and everyone i'm general non from completed ascensions to montreal

to downwind at a commode our work on

analysis of web estimation tool

this sre two thousand nineteen cmn and bus challenges

in this work i'm going to provide an overview of a busy in some mission

for

nist sre two thousand nineteen by

brno university of technology can be due to such an this from montreal

for next you know on the and you am

on this is the outline of my dark follows

i will

i'm going to start and introduction of the data and

going to talk about the speaker verification is in conversation television the telephone speech

once you meant to not to

then i'll talk about them onto medea speaker verification on lost

that i employing

audio and phase biometric traits

finally i'm going to draw my conclusion

introduction

e

to the nineteen edition of nist sre

there are two task

one task is

others

speaker verification on conversation telephone speech where there is a domain mismatch between

and train and test sitting mainly due to

difference in languages away i mean

in

training data mostly nist speak english where is the yesterday's in arabic

the second task to the multimedia speaker recognition over robust

we do this is just speech technology that the main challenges here is the multi-speaker

test recording

there are two sub tasks in the last task

one is the verification of a speaker verification or a speaker on audio but wheezing

modeled by a to trade only what is

i have verification in dust to verification of a speaker employing both or the u

and pairs biometric traits

in this work we present the system that brought by a visiting

to tackle the challenges introduced in boat

cmn to and what does task of nist sre two thousand nineteen

and we problem provides some analyses of results

data preparation original data are used for training speaker discriminant neural network are nist sre

these

two thousand forty two thousand and fisher a big

all switchboard bookseller one and two

supplemented data is created by the most on

room impulse response from openness alarm and also using compression

well in the origin gmm be decoded

only

five hundred k recordings was selected as supplemented a trial

and added to the original the don't

both to increase the to increase that morgan i wasn't of the training data

after filtering based on minimum mellow duration

i in discussed five second after bad

and minimum number is speaker utterances party speaker in this case five utterance but a

speaker

there are approximately is

seven entails in the speaker in the training data

that i used for background training the nist sre is to those and

for the two thousand and having approximately

sixty six thousand recording

adaptation that you'll is based on

sorry eighteen a set it to those an eighteen they have longer than sixty percent

of the study eighteen

evolve

there are total

a thousand recording from one hundred thirty seven speaker

part of evaluations a part of adaptation set and sre

eighteen

unlabeled data are unique or where used for score normalization

and as developments

test set we used for forty percent of the you well other missing the remaining

forty percent of the e well wondered

feature extraction or we

as local feature we use forty dimensional filter bank or twenty two dimensional mfcc features

extracted

well what twenty minutes twenty five milisecond windows with different should go of ten milisecond

for feature normalisation short-term cepstral mean normalisation was used with a sliding window of three

second

and on the speech frames or anymore would energy of is band

in general pipeline that has been adopted for

speaker verification on cmn for us

is

i don't

with the boys are phase

current trained in a speaker verification is to use this

due to speaker them very

with this filter will back end

why the speaker embedding set extracted using a speaker discriminant neural network

which is normally trained to discriminate among a set of training speaker

and the network is normally a supervised by some variants of classification laws such as

such as softmax or metric learning loss function

in this

case for cmn to does will use for speaker discriminant neural network trained with four

different architecture

as a back and we use either gaussian plp you're here we don't really more

of

evaluation and weightings are centered using me no adaptation set

what is back and training set o a mating with standard using the you know

the same set

training and maybe this are adapted to the target domain using

feature distribution adaptation so the finally we use feature somewhere based plp adaptation

over to be lda model switches

trained on unadapted an undirected speaker time at

score normalization are

is used to died of conventional

all we double of for each individual system for semen to task

system wanna use a standard fifty layer resonate architecture for training

speaker discriminant neural net global field of and feature

and

portion billy is used for the scoring

two dimensional convolution is used as we are using filter of and feature

and five to obtain global representation from local in

features

statistics pulling is used

in this system

for training the lda model

additional training data are used from sre two thousand six ten evaluation data

that contain and ten thousand recording from

two hundred one is speakers

post processing is employed estimation and in the previous on this agenda five in the

channel pipeline

in

system to system to employs a factor to deny architecture for training the speaker discriminant

neural network

colour the sre sixteen recipe will use for this case and the network was trained

for six a box

as back and he wouldn't be lda use you would forming the channel a pipeline

that has been mentioned before

for system three divinely architectures selected to train the speaker discriminant neural network is the

extended to dinner architecture with the fuel residual connection to this layers

and the network is

the was trained for to you box

in this case extractor limiting our of

seven hundred sixty eight dimensional instead of five hundred well

and them meaning as j noise judging and denoising

auto-encoder

one dimensional convolution is used over mfcc features other in this statistic putting is used

for

generating

global utterance labeled

representation

so we nearly as big and heavy tailed ple the use of following the general

parkland that has been mentioned before

finding system for similar depending architecture is used in system to you were used for

training speaker discriminant neural network

but this network was trained only on a thirty s

two thousand four two doesn't an english data

and this is mfcc feature is you that's fine grained feature

yes turned our gender developer syrian

network is used on the topic and more no

mainly to discriminate between the source and target

domains

so is domain our classes english or target domain is arabic

extracted a meeting this case of for seven and a sixty dimensional

and as back inhibited really is used following the general pipeline that of the mentioned

before

calibration and fusion for c one two task what calibration and fusion are trained when

the logistic regression on an emblem as

so

and consistent performance or absorb across the progress any well set

which indicates that the

well we achieve almost perfect

calibration

in table one presents the results of individual and fused system on they have any

well set for cmn troll dolls

seven does

single best system we found here system an i-vector k d n and would have

yielded ple a combination

the denoising did not have but when fused with that t systems

it resulted

in a nice improvement over performance

i'll feel system provided the best performance in this case

in table two we present and compare is an performance using different backends with the

resonant

and vector

to the ann architectures

for cm into two cm and two dollars ple awakens are clearly don't we know

this is perhaps due to the dimension

between train and test settings

in table three

we show the performance with this system andreas post processing was adopted

or what the extracted speaker i'm waiting

from this to what we can see that or what the extracted animating when mean

centring

feature distribution and addition kld adaptation and as norm was processed in combination not

widely used the lead to the

based performance

finally a robust task

data preparation

what original data used in this case is for training speaker

discriminant neural network mainly voxel of two development data

which normally contents six

six thousand the speaker

but forty d n and the system bookseller one and two

nobody speech

the reminder to say

combine which consist of around

eleven thousand the speakers are used for training

s supplementary that is created

my using most on an room impulse response from open a sum up

and all the five mean only recordings from just supplemented data

was selected to this really is selected to add to the original local

well for increasing the mold and i was it in the training data

after filtering based on minimum allowed duration in this case or second of curve

voice activity detection

and minimum number utterance participated in this case is eight utterances per speaker

there are approximately six thousand the speaker in the training data

that i used for bacon training is one hundred

forty five utterances from original training data

adaptation set is based on thirty seven utterances from sre eighteen busted of data

a subset of the lda training data is used for us

score normalization using a small

the implement

this

yes

test set chosen for audio-only sub task is sre eighteen busty well

where is

for audiovisual task development test set is sre nineteen or dave is one and implement

at all

feature extraction

for robust task as local feature we use forty dimensional filter bank

or twenty three dimensional plp features are extracted would

a twenty five milisecond window over a frame shift of ten milisecond

for feature normalisation o we use short-term cepstral mean normalisation with this sliding window of

two second

and

none the speech frames are removed using an energy based voice activity detector

and for the last

or do you only does channel the general pipeline is

we used t speaker discriminant neural network trained with three different architecture in order to

extend the

speaker and maybe

as bank and

we use question p l d your placenta scoring

a novel and meetings are centered using mean of the bank and training set in

many

training and weightings are adapted to the

target domain using feature distribution adaptation

diarisation is applied on the test set and a final score is the maximum or

what that additional or

and

is score is normalized then a message on

individual systems and envelope for vast audio-only system would have multisine images system for this

case

i system on the two uses the standard

group delay resonate architecture which is first be obtained using the softmax loss

and then after later it is finetune using adaptive

additive angular margin loss function

in this case as local features filterbank is used and as backend portion p l

d n constants codings values

and for post processing we use general that the form of the general background that

you mentioned before

and system two in this case the

two d nn architecture for training speaker discriminant neural network

and this network is trained using all the

icily sixteen recipe over a box and i one and two

liberty speech

and reminded of is for six a box

as they can go action clearly model is used forming the gmm parkland that has

been mentioned before

system three

is trained following colour the

as that a system recipe on the sre

to those and fortitude doesn't and all switchboard data for to you box

as front end of

feature plp is use

augmented sre is to those in for two thousand

dehne dan data was used for training the backend monitor

correlation alignment based domain adaptation used for adapting source domain to the target domain in

this case

as back and why shouldn't really is used and for system three

no score normalization was used

this is test data contain multi speaker recording we add up to speaker diarization to

obtain number of the speaker and there but how much of the speaker segments according

to speaker identity

for each

test utterance we extract expect or four

every two hundred fifty milliseconds

then and no more to hear it together clustering is used to cluster made things

into

one two

three or four the speaker cluster

and m baiting spar test for the extractor for it just speaker

enrollment embedding is scored against all and test "'em" begins and finally the score is

the maximum about ten

and is called

for

it is well-known lead to us so on lost us with the release of us

of blast stars with the of low to system

a system one and it scares is which is a pretty again

school is the excitation version or resonant

fifty which is to and only be g d vista dataset

and all this

this peter network is used for extracting us to extract phase and mary

for enrollment data based on provided frame in nieces and of a c l m

baiting spatial modeling boxes

colours only in each regions are cross

and normalized before posting to the peak in model for animating destruction

speaker is represented by averaging enrollment and mating

what is that the signal shortest getting very infested a tool is used to detect

one phase

our second

in the test data

for a scoring also similar it was found similar to do you the but official

embedding

and maximize score is selected

no score normalization is applied for any of the v is well systems

system to a similar to his well only system on system to also used of

a pre-training

us

school is x addition motion orders and on wages the

face to dataset to extract for estimating

but for the system at each frame multiple bounding box are extracted using

mounted s c n

kalman filtering is applied to track the extracted to bounding boxes from frame to frame

chinese is available to them is applied for clustering and this other than does not

use any prior information about the number of clusters

for enrollment a speaker is represented by averaging betting

for the scoring in the system console similar to a use usable information limiting and

the maximum score that selected

of calibration and fusion for impostors and

this calibration official is trained ear logistic regression and of long as test set

and sre eight in a plastic one was used for calibration and fusion for audio-only

tiles

and sre nineteen

audio-visual development set was used for calibration and fusion for audiovisual systems

performance evaluation

in table four we compare different back end on the top of bayes net would

additive and low margin

softmax architecture

we can see from here and that

adaptation is a score normalisation are found helpful

cosine escorting outperformed the p lda back-end in impostor audio new task

or have this is due to the fact that there's not much

domain she between train and test fitting in this case

in table five we show the influence of using data position on monte speaker test

recording for of a story on it does

we can see from here the validation help to boast performance

in

this table we present already you

this is well

single and feel system and audiovisual few systems

performance on than they have an email test set

we can see from here if we shouldn't how to improve performance

the performance of a lot of is well-known this is to a not that much

goal but

when visual modalities fuel to the audio modality

i huge improvement in performance

well actually over it you know model systems

finally the convolutional can say

adaptation of source domain to the target domain have played a vital role for both

cm into and was tossed using either

fine tuning of speaker discriminant neural net control the target domain

or i that is encode relational unmanned or feature distribution and a petition or a

our domain adaptation using is standard down

diarisation how to boast performance in multi

speaker can work

test recording scenario

simple the score level fusion or more the un phase biometrics

provided significant it

performance improvement over you know model system

which indicates that the reason exists complementarity between audio and visual model it is

thank you very much for your attention