and everyone i'm general non from completed ascensions to montreal
to downwind at a commode our work on
analysis of web estimation tool
this sre two thousand nineteen cmn and bus challenges
in this work i'm going to provide an overview of a busy in some mission
for
nist sre two thousand nineteen by
brno university of technology can be due to such an this from montreal
for next you know on the and you am
on this is the outline of my dark follows
i will
i'm going to start and introduction of the data and
going to talk about the speaker verification is in conversation television the telephone speech
once you meant to not to
then i'll talk about them onto medea speaker verification on lost
that i employing
audio and phase biometric traits
finally i'm going to draw my conclusion
introduction
e
to the nineteen edition of nist sre
there are two task
one task is
others
speaker verification on conversation telephone speech where there is a domain mismatch between
and train and test sitting mainly due to
difference in languages away i mean
in
training data mostly nist speak english where is the yesterday's in arabic
the second task to the multimedia speaker recognition over robust
we do this is just speech technology that the main challenges here is the multi-speaker
test recording
there are two sub tasks in the last task
one is the verification of a speaker verification or a speaker on audio but wheezing
modeled by a to trade only what is
i have verification in dust to verification of a speaker employing both or the u
and pairs biometric traits
in this work we present the system that brought by a visiting
to tackle the challenges introduced in boat
cmn to and what does task of nist sre two thousand nineteen
and we problem provides some analyses of results
data preparation original data are used for training speaker discriminant neural network are nist sre
these
two thousand forty two thousand and fisher a big
all switchboard bookseller one and two
supplemented data is created by the most on
room impulse response from openness alarm and also using compression
well in the origin gmm be decoded
only
five hundred k recordings was selected as supplemented a trial
and added to the original the don't
both to increase the to increase that morgan i wasn't of the training data
after filtering based on minimum mellow duration
i in discussed five second after bad
and minimum number is speaker utterances party speaker in this case five utterance but a
speaker
there are approximately is
seven entails in the speaker in the training data
that i used for background training the nist sre is to those and
for the two thousand and having approximately
sixty six thousand recording
adaptation that you'll is based on
sorry eighteen a set it to those an eighteen they have longer than sixty percent
of the study eighteen
evolve
there are total
a thousand recording from one hundred thirty seven speaker
part of evaluations a part of adaptation set and sre
eighteen
unlabeled data are unique or where used for score normalization
and as developments
test set we used for forty percent of the you well other missing the remaining
forty percent of the e well wondered
feature extraction or we
as local feature we use forty dimensional filter bank or twenty two dimensional mfcc features
extracted
well what twenty minutes twenty five milisecond windows with different should go of ten milisecond
for feature normalisation short-term cepstral mean normalisation was used with a sliding window of three
second
and on the speech frames or anymore would energy of is band
in general pipeline that has been adopted for
speaker verification on cmn for us
is
i don't
with the boys are phase
current trained in a speaker verification is to use this
due to speaker them very
with this filter will back end
why the speaker embedding set extracted using a speaker discriminant neural network
which is normally trained to discriminate among a set of training speaker
and the network is normally a supervised by some variants of classification laws such as
such as softmax or metric learning loss function
in this
case for cmn to does will use for speaker discriminant neural network trained with four
different architecture
as a back and we use either gaussian plp you're here we don't really more
of
evaluation and weightings are centered using me no adaptation set
what is back and training set o a mating with standard using the you know
the same set
training and maybe this are adapted to the target domain using
feature distribution adaptation so the finally we use feature somewhere based plp adaptation
over to be lda model switches
trained on unadapted an undirected speaker time at
score normalization are
is used to died of conventional
all we double of for each individual system for semen to task
system wanna use a standard fifty layer resonate architecture for training
speaker discriminant neural net global field of and feature
and
portion billy is used for the scoring
two dimensional convolution is used as we are using filter of and feature
and five to obtain global representation from local in
features
statistics pulling is used
in this system
for training the lda model
additional training data are used from sre two thousand six ten evaluation data
that contain and ten thousand recording from
two hundred one is speakers
post processing is employed estimation and in the previous on this agenda five in the
channel pipeline
in
system to system to employs a factor to deny architecture for training the speaker discriminant
neural network
colour the sre sixteen recipe will use for this case and the network was trained
for six a box
as back and he wouldn't be lda use you would forming the channel a pipeline
that has been mentioned before
for system three divinely architectures selected to train the speaker discriminant neural network is the
extended to dinner architecture with the fuel residual connection to this layers
and the network is
the was trained for to you box
in this case extractor limiting our of
seven hundred sixty eight dimensional instead of five hundred well
and them meaning as j noise judging and denoising
auto-encoder
one dimensional convolution is used over mfcc features other in this statistic putting is used
for
generating
global utterance labeled
representation
so we nearly as big and heavy tailed ple the use of following the general
parkland that has been mentioned before
finding system for similar depending architecture is used in system to you were used for
training speaker discriminant neural network
but this network was trained only on a thirty s
two thousand four two doesn't an english data
and this is mfcc feature is you that's fine grained feature
yes turned our gender developer syrian
network is used on the topic and more no
mainly to discriminate between the source and target
domains
so is domain our classes english or target domain is arabic
extracted a meeting this case of for seven and a sixty dimensional
and as back inhibited really is used following the general pipeline that of the mentioned
before
calibration and fusion for c one two task what calibration and fusion are trained when
the logistic regression on an emblem as
so
and consistent performance or absorb across the progress any well set
which indicates that the
well we achieve almost perfect
calibration
in table one presents the results of individual and fused system on they have any
well set for cmn troll dolls
seven does
single best system we found here system an i-vector k d n and would have
yielded ple a combination
the denoising did not have but when fused with that t systems
it resulted
in a nice improvement over performance
i'll feel system provided the best performance in this case
in table two we present and compare is an performance using different backends with the
resonant
and vector
to the ann architectures
for cm into two cm and two dollars ple awakens are clearly don't we know
this is perhaps due to the dimension
between train and test settings
in table three
we show the performance with this system andreas post processing was adopted
or what the extracted speaker i'm waiting
from this to what we can see that or what the extracted animating when mean
centring
feature distribution and addition kld adaptation and as norm was processed in combination not
widely used the lead to the
based performance
finally a robust task
data preparation
what original data used in this case is for training speaker
discriminant neural network mainly voxel of two development data
which normally contents six
six thousand the speaker
but forty d n and the system bookseller one and two
nobody speech
the reminder to say
combine which consist of around
eleven thousand the speakers are used for training
s supplementary that is created
my using most on an room impulse response from open a sum up
and all the five mean only recordings from just supplemented data
was selected to this really is selected to add to the original local
well for increasing the mold and i was it in the training data
after filtering based on minimum allowed duration in this case or second of curve
voice activity detection
and minimum number utterance participated in this case is eight utterances per speaker
there are approximately six thousand the speaker in the training data
that i used for bacon training is one hundred
forty five utterances from original training data
adaptation set is based on thirty seven utterances from sre eighteen busted of data
a subset of the lda training data is used for us
score normalization using a small
the implement
this
yes
test set chosen for audio-only sub task is sre eighteen busty well
where is
for audiovisual task development test set is sre nineteen or dave is one and implement
at all
feature extraction
for robust task as local feature we use forty dimensional filter bank
or twenty three dimensional plp features are extracted would
a twenty five milisecond window over a frame shift of ten milisecond
for feature normalisation o we use short-term cepstral mean normalisation with this sliding window of
two second
and
none the speech frames are removed using an energy based voice activity detector
and for the last
or do you only does channel the general pipeline is
we used t speaker discriminant neural network trained with three different architecture in order to
extend the
speaker and maybe
as bank and
we use question p l d your placenta scoring
a novel and meetings are centered using mean of the bank and training set in
many
training and weightings are adapted to the
target domain using feature distribution adaptation
diarisation is applied on the test set and a final score is the maximum or
what that additional or
and
is score is normalized then a message on
individual systems and envelope for vast audio-only system would have multisine images system for this
case
i system on the two uses the standard
group delay resonate architecture which is first be obtained using the softmax loss
and then after later it is finetune using adaptive
additive angular margin loss function
in this case as local features filterbank is used and as backend portion p l
d n constants codings values
and for post processing we use general that the form of the general background that
you mentioned before
and system two in this case the
two d nn architecture for training speaker discriminant neural network
and this network is trained using all the
icily sixteen recipe over a box and i one and two
liberty speech
and reminded of is for six a box
as they can go action clearly model is used forming the gmm parkland that has
been mentioned before
system three
is trained following colour the
as that a system recipe on the sre
to those and fortitude doesn't and all switchboard data for to you box
as front end of
feature plp is use
augmented sre is to those in for two thousand
dehne dan data was used for training the backend monitor
correlation alignment based domain adaptation used for adapting source domain to the target domain in
this case
as back and why shouldn't really is used and for system three
no score normalization was used
this is test data contain multi speaker recording we add up to speaker diarization to
obtain number of the speaker and there but how much of the speaker segments according
to speaker identity
for each
test utterance we extract expect or four
every two hundred fifty milliseconds
then and no more to hear it together clustering is used to cluster made things
into
one two
three or four the speaker cluster
and m baiting spar test for the extractor for it just speaker
enrollment embedding is scored against all and test "'em" begins and finally the score is
the maximum about ten
and is called
for
it is well-known lead to us so on lost us with the release of us
of blast stars with the of low to system
a system one and it scares is which is a pretty again
school is the excitation version or resonant
fifty which is to and only be g d vista dataset
and all this
this peter network is used for extracting us to extract phase and mary
for enrollment data based on provided frame in nieces and of a c l m
baiting spatial modeling boxes
colours only in each regions are cross
and normalized before posting to the peak in model for animating destruction
speaker is represented by averaging enrollment and mating
what is that the signal shortest getting very infested a tool is used to detect
one phase
our second
in the test data
for a scoring also similar it was found similar to do you the but official
embedding
and maximize score is selected
no score normalization is applied for any of the v is well systems
system to a similar to his well only system on system to also used of
a pre-training
us
school is x addition motion orders and on wages the
face to dataset to extract for estimating
but for the system at each frame multiple bounding box are extracted using
mounted s c n
kalman filtering is applied to track the extracted to bounding boxes from frame to frame
chinese is available to them is applied for clustering and this other than does not
use any prior information about the number of clusters
for enrollment a speaker is represented by averaging betting
for the scoring in the system console similar to a use usable information limiting and
the maximum score that selected
of calibration and fusion for impostors and
this calibration official is trained ear logistic regression and of long as test set
and sre eight in a plastic one was used for calibration and fusion for audio-only
tiles
and sre nineteen
audio-visual development set was used for calibration and fusion for audiovisual systems
performance evaluation
in table four we compare different back end on the top of bayes net would
additive and low margin
softmax architecture
we can see from here and that
adaptation is a score normalisation are found helpful
cosine escorting outperformed the p lda back-end in impostor audio new task
or have this is due to the fact that there's not much
domain she between train and test fitting in this case
in table five we show the influence of using data position on monte speaker test
recording for of a story on it does
we can see from here the validation help to boast performance
in
this table we present already you
this is well
single and feel system and audiovisual few systems
performance on than they have an email test set
we can see from here if we shouldn't how to improve performance
the performance of a lot of is well-known this is to a not that much
goal but
when visual modalities fuel to the audio modality
i huge improvement in performance
well actually over it you know model systems
finally the convolutional can say
adaptation of source domain to the target domain have played a vital role for both
cm into and was tossed using either
fine tuning of speaker discriminant neural net control the target domain
or i that is encode relational unmanned or feature distribution and a petition or a
our domain adaptation using is standard down
diarisation how to boast performance in multi
speaker can work
test recording scenario
simple the score level fusion or more the un phase biometrics
provided significant it
performance improvement over you know model system
which indicates that the reason exists complementarity between audio and visual model it is
thank you very much for your attention