and everyone whiners are from johns hopkins university

a compromise

what is your presentation my our framework is on speaker verification and speech enhancement

let's say that six lights

i love this presentation is a another system which allows this enhancement or speaker verification

and to be using some slides from my previous work i guess was called feature

enhancement but

the feature classes for speaker verification

i mean downstream does is speaker verification

and the problem refers to the

task of data mining if speaker an utterance one

just and drawn inference is same as

these you got an utterance to which is the test utterance

the state-of-the-art we implement this is to use a so-called extractor network and

a probabilistic linear discriminant analysis is okay

and also due date or addition

in conjunction

speech enhancement

is once this problem but you have speaker verification

by any preprocessing and rule and test utterances during this time

it has a node is the speech enhancement maybe on helps when trained in the

and then of speaker recognition option

and three pursue a title frame only fisherman's training

which

next the two problems as we can see how

this is the schematic of each feature loss training was you can see there are

two networks one is e

one just has one or another one is denoted by e which is t alternately

network

the enhancement network takes noisy features and produced enhanced features

these enhanced features are not directly compare between features however they are for us to

also unit for and the intermediate activity activations in the small sooner or we know

the differences in them and they are known as a feature loss

when we don't use this on clean and fruit and simply choose

compared enhanced features indicating features

in a score

feature mostly

this can imagine

this type of training is doing enhancement however results you'd information also

that is then exquisitely is also unit for

this is how or speaker verification that looks like the enrollment and test going through

feature extraction independently and also enhanced independently then

well healthy

a phones goes through our invariant structure which is our case expected network

and

and the but a classifier

tries to give them a log-likelihood ratio and say

the there is

same speaker or not

no these of the details on how database extraction is ten

we use

and use a corpus which consists of

three or instances only use a

gender noises

and that'll

these

the noise classes are used to

combine

with

also the within sixteen khz conversations statistic as a

and be just wrote also the combined and it is three times but also

the emission works of the is

is so some wild

i a fifty percent rate

randomly agreeable so the utterance for it to

we also use

it s not filtering algorithm called about as an two

create a fifty percent you what's alone

and it is supposed to preserve the highest and utterances from work so

such clean and version of also the is gain combined with

the news on

noise is and that serves as the noisy constant for our supervise enhancement training

this trend of the ldr frame with the what's the combined dimension and these see

that no networks a

does use

given more details the features that the use of forty dimensional measure of that

this is to see and other ways

the evaluation will be done on d v train a which is a corpus containing

a young children means that in and controlled environment

the complete data is to fifty hours for is and struck divided in detection and

a diarization task

we have not explained

the diarisation component you know pipeline

for the evaluation data a number of speakers in and roll and test r five

ninety five and one fifty respectively

and results are presented in form of equal error rate and minimum decision cost function

where target prior probability of five percent

the table that you see here is from our previous work which we want to

analyze in this work

use if you focus on the second

dataset column which is about maybe train

you can see for scroll

is actually without an enhanced and every and refers to the original version of x

are gonna work

and if do is just

a notation to denote

the type of be and es data used

so this rule actually give results on

that enhanced and it is seven point six percent eer and then we use a

feature lost which loss and also combination

and

in c d's usually give the best performance previously

assign a row zero

is the comparison between how much performance t and you can see

we just are feature allows efficient most

formants cleanest a or six k

having said that we want to address and questions

forces are

only the initial layers of course in a useful for the official of training

can't feature allows the additive which allows

second it is

for supervised and has the training how clean data is required

can i just using speech results of the

below are created database

mismatch issues

currently you extractor and all seen in four

are available pre-training on your emotions features can i used to train and has the

network each works the height of features can get an idea get some benefit

for this and has a really an expected data and of the training for the

improvements

faced is again and has features the bootstrap to training data double the amount of

data and make our extra to store the be obvious four

six is to see if the was less that we're working with a really useful

during the data condition process

is some of the noise class

even harmful

find regression is that as the proposed scheme for the task of dereverberation and joint

denoising anteater operation

or should be produce the baseline and see what there is good for differs a

lost a extraction

is

results table with a lot of numbers a better for this doesn't station it's enough

to focus on the first column which gives you the labels

for that i all loss or data that's going to use

and the final

a column is the mean result on the

no be retrained test set

but shows without it has then given then one nine percent eer

and then we have l d s l five between the feature last extracted from

five layers

and this

on signal folk has six layers

the fess up to five are used in this one and six is

the

classification in finding invariance we are not using for a particular role

i guess the best performance and z more combinations

to see

and the l f l is the feature loss and it gives you were worse

performance in and then baseline

this reduces observations from previous four

combining them was so

is not good point two percent

when you combine the embedding

years the last layer false in that for the d feature lost

it duh is also not helpful

and then the use

efficient loss five layer for later three layers two years and

finally one layer and they are not as good as using all the layers

the bottom half of the table is a decision cost function

the

observations are mostly same as the equal error rate

so here we have seen the feature losses in three artificial are or system

combining them

is also for

a more lazy use the best increase the computational complexity

well that's okay

the main data v is the

you need to

use you know if all silly layers from the jar

if we see the choice of training data set for enhanced and also you know

where

we see donovan to dash fisher the blue means

what's alone

with the bodice and i was used for the

and has and therefore and

also

on as a consequence for the

also network and gives the best performance you know by boldface or

one

using p c which is the what's of the

and b c we just have also combine

but in spots of the combined with the

the noise documentations

we also from we see if two indian in the has to know where

which is if you core

the you can of the three persons of some kind of what's

and it is not as

good as the bodice not filtering so

the shows that feeling screening all four

barcelona one snr seems to be unimportant

and use a little speech and

can see of course and point to a greater than i one and baseline

and solely for speech

i think being

in on conversational and mismatched data it is for training

even when used as a

clean counterpart for the enhanced

and hence the network

we also thing the powerful the also the network is that it is

and the old one is so

means that the more data is used and

the data condition is also that

you see if we mismatch the features and has the network can i use i

dimensional features and hence for network

second rule festival is

ellen

f b for the means log mel from the man features

for the dimension in has been network

recall that forty dimensional features are used in the opportunity for and the effectiveness of

also

show and this is the condition where the features are matched

so i don't need to learn any bridge between networks for this case of a

were four

if you dimension wanted to do and menus spectrogram

i there is a speech are mismatched and you need lower average between units as

well

and

is the results are not as good as the matched condition

seems like cannot it advantage of high dimensional features

literal

we also the spectrograms somehow since use of for a least or

but it is also

worse than the baseline

you see the effect of hasn't you lda and the or extractor data

for scroll is not as good as us to control was tested and then

alright consisted percent

that at home so we can see

the lda common test is written

as the label which means that be lda

and it is also has

and it does and so much rates it and seven percent

so for the mindcf we have

not much change so don't feel that the really is

is on benefiting an entire susceptible to a enhancement processing

if and hence the training set

there is improvement for the start baseline

which is an iterative system

however it's not as good as just so that has in the test

one and half of them since like

the robustness of the whole system is lost so it's not working for at least

four

this corpus

we combine the enhanced vision see if we can take advantage make them

complementary original features

no that wasn't we just means that even if a if conditions

and half which means the and score of all the data

in the column

you see all can be lda that means

meditation

is then be in the

to verify all can be lda

vol including original features as a listing and switches along with the data

it seems to be getting our

and

when i combined these features in training set

is actually doing much better performance seems like the network analysis double data and

there is also complement energy

in the

has features so they are

it can be bonastre

if i one station and the frame effect of these features in train as well

as the and the lda it doesn't

so this ensures that the lda is a suitable one hasn't processing

i started to just put i has features that or is not in the training

set up or a spoof an oak

now we see if i e one type of noise class from the expected network

r t

and hasn't data

so let's focus on the a lot of this table which is that the war

music and

see the last column we have i one zero five percent this means that

right i skate

using the music files from extract phonetic or and i also don't use enhancement actually

doing better than the based on which means

and then

removing music is good so this discussed actually also performance

next unseen means i used enhancement or what the

the on has filter has not seen use it

so it's still able to improve the one this is some and

most interestingly

and the use the

units seeing which is

when i using and has to network which has seen using it is the s

so it seems like some noise classes are

or are being

and

is that it just give them in x vector training

okay include them in the

a enhancement

training data

it to see if we can do you relation with division loss you try seven

schemes

use call so that would be e tradition earlier repetitions scheme trying to do you

duration and utilizing in

joint fashion

also and the distance fashioned which is denoted by joint one stage

a few we all these numbers

in c

the dereverberation is not actually working

we also suspect that's possible that a there is not possible configuration nevertheless t-norm things

since e

you have

a pre-processing step for a improving on this maybe straight

finally database are you can you need to choose also you know for you have

layers of it for this type of funding

and use one isa nine based filtering to keep highest not only you scores from

the

a construct a clean data for has to network training

the mismatch in and has to and hasn't and also very

and it is slightly worse is better to use same features

we see that the lda is not really

us a nice it's very susceptible to using enhanced data american put this next fortunate

for

some noise types are harder in for extracting data like music

and finally the duration is not or four

using this

state of training scheme

so that is the end of the presentation please feel free to send questions that

where we thank you