and everyone whiners are from johns hopkins university
a compromise
what is your presentation my our framework is on speaker verification and speech enhancement
let's say that six lights
i love this presentation is a another system which allows this enhancement or speaker verification
and to be using some slides from my previous work i guess was called feature
enhancement but
the feature classes for speaker verification
i mean downstream does is speaker verification
and the problem refers to the
task of data mining if speaker an utterance one
just and drawn inference is same as
these you got an utterance to which is the test utterance
the state-of-the-art we implement this is to use a so-called extractor network and
a probabilistic linear discriminant analysis is okay
and also due date or addition
in conjunction
speech enhancement
is once this problem but you have speaker verification
by any preprocessing and rule and test utterances during this time
it has a node is the speech enhancement maybe on helps when trained in the
and then of speaker recognition option
and three pursue a title frame only fisherman's training
which
next the two problems as we can see how
this is the schematic of each feature loss training was you can see there are
two networks one is e
one just has one or another one is denoted by e which is t alternately
network
the enhancement network takes noisy features and produced enhanced features
these enhanced features are not directly compare between features however they are for us to
also unit for and the intermediate activity activations in the small sooner or we know
the differences in them and they are known as a feature loss
when we don't use this on clean and fruit and simply choose
compared enhanced features indicating features
in a score
feature mostly
this can imagine
this type of training is doing enhancement however results you'd information also
that is then exquisitely is also unit for
this is how or speaker verification that looks like the enrollment and test going through
feature extraction independently and also enhanced independently then
well healthy
a phones goes through our invariant structure which is our case expected network
and
and the but a classifier
tries to give them a log-likelihood ratio and say
the there is
same speaker or not
no these of the details on how database extraction is ten
we use
and use a corpus which consists of
three or instances only use a
gender noises
and that'll
these
the noise classes are used to
combine
with
also the within sixteen khz conversations statistic as a
and be just wrote also the combined and it is three times but also
the emission works of the is
is so some wild
i a fifty percent rate
randomly agreeable so the utterance for it to
we also use
it s not filtering algorithm called about as an two
create a fifty percent you what's alone
and it is supposed to preserve the highest and utterances from work so
such clean and version of also the is gain combined with
the news on
noise is and that serves as the noisy constant for our supervise enhancement training
this trend of the ldr frame with the what's the combined dimension and these see
that no networks a
does use
given more details the features that the use of forty dimensional measure of that
this is to see and other ways
the evaluation will be done on d v train a which is a corpus containing
a young children means that in and controlled environment
the complete data is to fifty hours for is and struck divided in detection and
a diarization task
we have not explained
the diarisation component you know pipeline
for the evaluation data a number of speakers in and roll and test r five
ninety five and one fifty respectively
and results are presented in form of equal error rate and minimum decision cost function
where target prior probability of five percent
the table that you see here is from our previous work which we want to
analyze in this work
use if you focus on the second
dataset column which is about maybe train
you can see for scroll
is actually without an enhanced and every and refers to the original version of x
are gonna work
and if do is just
a notation to denote
the type of be and es data used
so this rule actually give results on
that enhanced and it is seven point six percent eer and then we use a
feature lost which loss and also combination
and
in c d's usually give the best performance previously
assign a row zero
is the comparison between how much performance t and you can see
we just are feature allows efficient most
formants cleanest a or six k
having said that we want to address and questions
forces are
only the initial layers of course in a useful for the official of training
can't feature allows the additive which allows
second it is
for supervised and has the training how clean data is required
can i just using speech results of the
below are created database
mismatch issues
currently you extractor and all seen in four
are available pre-training on your emotions features can i used to train and has the
network each works the height of features can get an idea get some benefit
for this and has a really an expected data and of the training for the
improvements
faced is again and has features the bootstrap to training data double the amount of
data and make our extra to store the be obvious four
six is to see if the was less that we're working with a really useful
during the data condition process
is some of the noise class
even harmful
find regression is that as the proposed scheme for the task of dereverberation and joint
denoising anteater operation
or should be produce the baseline and see what there is good for differs a
lost a extraction
is
results table with a lot of numbers a better for this doesn't station it's enough
to focus on the first column which gives you the labels
for that i all loss or data that's going to use
and the final
a column is the mean result on the
no be retrained test set
but shows without it has then given then one nine percent eer
and then we have l d s l five between the feature last extracted from
five layers
and this
on signal folk has six layers
the fess up to five are used in this one and six is
the
classification in finding invariance we are not using for a particular role
i guess the best performance and z more combinations
to see
and the l f l is the feature loss and it gives you were worse
performance in and then baseline
this reduces observations from previous four
combining them was so
is not good point two percent
when you combine the embedding
years the last layer false in that for the d feature lost
it duh is also not helpful
and then the use
efficient loss five layer for later three layers two years and
finally one layer and they are not as good as using all the layers
the bottom half of the table is a decision cost function
the
observations are mostly same as the equal error rate
so here we have seen the feature losses in three artificial are or system
combining them
is also for
a more lazy use the best increase the computational complexity
well that's okay
the main data v is the
you need to
use you know if all silly layers from the jar
if we see the choice of training data set for enhanced and also you know
where
we see donovan to dash fisher the blue means
what's alone
with the bodice and i was used for the
and has and therefore and
also
on as a consequence for the
also network and gives the best performance you know by boldface or
one
using p c which is the what's of the
and b c we just have also combine
but in spots of the combined with the
the noise documentations
we also from we see if two indian in the has to know where
which is if you core
the you can of the three persons of some kind of what's
and it is not as
good as the bodice not filtering so
the shows that feeling screening all four
barcelona one snr seems to be unimportant
and use a little speech and
can see of course and point to a greater than i one and baseline
and solely for speech
i think being
in on conversational and mismatched data it is for training
even when used as a
clean counterpart for the enhanced
and hence the network
we also thing the powerful the also the network is that it is
and the old one is so
means that the more data is used and
the data condition is also that
you see if we mismatch the features and has the network can i use i
dimensional features and hence for network
second rule festival is
ellen
f b for the means log mel from the man features
for the dimension in has been network
recall that forty dimensional features are used in the opportunity for and the effectiveness of
also
show and this is the condition where the features are matched
so i don't need to learn any bridge between networks for this case of a
were four
if you dimension wanted to do and menus spectrogram
i there is a speech are mismatched and you need lower average between units as
well
and
is the results are not as good as the matched condition
seems like cannot it advantage of high dimensional features
literal
we also the spectrograms somehow since use of for a least or
but it is also
worse than the baseline
you see the effect of hasn't you lda and the or extractor data
for scroll is not as good as us to control was tested and then
alright consisted percent
that at home so we can see
the lda common test is written
as the label which means that be lda
and it is also has
and it does and so much rates it and seven percent
so for the mindcf we have
not much change so don't feel that the really is
is on benefiting an entire susceptible to a enhancement processing
if and hence the training set
there is improvement for the start baseline
which is an iterative system
however it's not as good as just so that has in the test
one and half of them since like
the robustness of the whole system is lost so it's not working for at least
four
this corpus
we combine the enhanced vision see if we can take advantage make them
complementary original features
no that wasn't we just means that even if a if conditions
and half which means the and score of all the data
in the column
you see all can be lda that means
meditation
is then be in the
to verify all can be lda
vol including original features as a listing and switches along with the data
it seems to be getting our
and
when i combined these features in training set
is actually doing much better performance seems like the network analysis double data and
there is also complement energy
in the
has features so they are
it can be bonastre
if i one station and the frame effect of these features in train as well
as the and the lda it doesn't
so this ensures that the lda is a suitable one hasn't processing
i started to just put i has features that or is not in the training
set up or a spoof an oak
now we see if i e one type of noise class from the expected network
r t
and hasn't data
so let's focus on the a lot of this table which is that the war
music and
see the last column we have i one zero five percent this means that
right i skate
using the music files from extract phonetic or and i also don't use enhancement actually
doing better than the based on which means
and then
removing music is good so this discussed actually also performance
next unseen means i used enhancement or what the
the on has filter has not seen use it
so it's still able to improve the one this is some and
most interestingly
and the use the
units seeing which is
when i using and has to network which has seen using it is the s
so it seems like some noise classes are
or are being
and
is that it just give them in x vector training
okay include them in the
a enhancement
training data
it to see if we can do you relation with division loss you try seven
schemes
use call so that would be e tradition earlier repetitions scheme trying to do you
duration and utilizing in
joint fashion
also and the distance fashioned which is denoted by joint one stage
a few we all these numbers
in c
the dereverberation is not actually working
we also suspect that's possible that a there is not possible configuration nevertheless t-norm things
since e
you have
a pre-processing step for a improving on this maybe straight
finally database are you can you need to choose also you know for you have
layers of it for this type of funding
and use one isa nine based filtering to keep highest not only you scores from
the
a construct a clean data for has to network training
the mismatch in and has to and hasn't and also very
and it is slightly worse is better to use same features
we see that the lda is not really
us a nice it's very susceptible to using enhanced data american put this next fortunate
for
some noise types are harder in for extracting data like music
and finally the duration is not or four
using this
state of training scheme
so that is the end of the presentation please feel free to send questions that
where we thank you