0:00:16and everyone whiners are from johns hopkins university
0:00:20a compromise
0:00:22what is your presentation my our framework is on speaker verification and speech enhancement
0:00:27let's say that six lights
0:00:36i love this presentation is a another system which allows this enhancement or speaker verification
0:00:43and to be using some slides from my previous work i guess was called feature
0:00:49enhancement but
0:00:50the feature classes for speaker verification
0:00:56i mean downstream does is speaker verification
0:00:59and the problem refers to the
0:01:03task of data mining if speaker an utterance one
0:01:06just and drawn inference is same as
0:01:09these you got an utterance to which is the test utterance
0:01:13the state-of-the-art we implement this is to use a so-called extractor network and
0:01:19a probabilistic linear discriminant analysis is okay
0:01:23and also due date or addition
0:01:27in conjunction
0:01:30speech enhancement
0:01:31is once this problem but you have speaker verification
0:01:35by any preprocessing and rule and test utterances during this time
0:01:42it has a node is the speech enhancement maybe on helps when trained in the
0:01:48and then of speaker recognition option
0:01:52and three pursue a title frame only fisherman's training
0:01:56which
0:01:56next the two problems as we can see how
0:02:02this is the schematic of each feature loss training was you can see there are
0:02:08two networks one is e
0:02:10one just has one or another one is denoted by e which is t alternately
0:02:15network
0:02:18the enhancement network takes noisy features and produced enhanced features
0:02:23these enhanced features are not directly compare between features however they are for us to
0:02:30also unit for and the intermediate activity activations in the small sooner or we know
0:02:37the differences in them and they are known as a feature loss
0:02:43when we don't use this on clean and fruit and simply choose
0:02:46compared enhanced features indicating features
0:02:49in a score
0:02:51feature mostly
0:02:54this can imagine
0:02:55this type of training is doing enhancement however results you'd information also
0:03:02that is then exquisitely is also unit for
0:03:08this is how or speaker verification that looks like the enrollment and test going through
0:03:15feature extraction independently and also enhanced independently then
0:03:22well healthy
0:03:23a phones goes through our invariant structure which is our case expected network
0:03:30and
0:03:31and the but a classifier
0:03:33tries to give them a log-likelihood ratio and say
0:03:38the there is
0:03:40same speaker or not
0:03:44no these of the details on how database extraction is ten
0:03:49we use
0:03:50and use a corpus which consists of
0:03:53three or instances only use a
0:03:57gender noises
0:03:59and that'll
0:04:02these
0:04:03the noise classes are used to
0:04:06combine
0:04:07with
0:04:08also the within sixteen khz conversations statistic as a
0:04:13and be just wrote also the combined and it is three times but also
0:04:19the emission works of the is
0:04:22is so some wild
0:04:25i a fifty percent rate
0:04:28randomly agreeable so the utterance for it to
0:04:33we also use
0:04:35it s not filtering algorithm called about as an two
0:04:39create a fifty percent you what's alone
0:04:42and it is supposed to preserve the highest and utterances from work so
0:04:48such clean and version of also the is gain combined with
0:04:53the news on
0:04:54noise is and that serves as the noisy constant for our supervise enhancement training
0:05:04this trend of the ldr frame with the what's the combined dimension and these see
0:05:09that no networks a
0:05:11does use
0:05:14given more details the features that the use of forty dimensional measure of that
0:05:20this is to see and other ways
0:05:22the evaluation will be done on d v train a which is a corpus containing
0:05:27a young children means that in and controlled environment
0:05:32the complete data is to fifty hours for is and struck divided in detection and
0:05:37a diarization task
0:05:40we have not explained
0:05:42the diarisation component you know pipeline
0:05:46for the evaluation data a number of speakers in and roll and test r five
0:05:51ninety five and one fifty respectively
0:05:54and results are presented in form of equal error rate and minimum decision cost function
0:06:00where target prior probability of five percent
0:06:05the table that you see here is from our previous work which we want to
0:06:10analyze in this work
0:06:13use if you focus on the second
0:06:17dataset column which is about maybe train
0:06:21you can see for scroll
0:06:24is actually without an enhanced and every and refers to the original version of x
0:06:29are gonna work
0:06:30and if do is just
0:06:32a notation to denote
0:06:34the type of be and es data used
0:06:38so this rule actually give results on
0:06:42that enhanced and it is seven point six percent eer and then we use a
0:06:47feature lost which loss and also combination
0:06:51and
0:06:52in c d's usually give the best performance previously
0:06:57assign a row zero
0:06:58is the comparison between how much performance t and you can see
0:07:04we just are feature allows efficient most
0:07:07formants cleanest a or six k
0:07:11having said that we want to address and questions
0:07:16forces are
0:07:18only the initial layers of course in a useful for the official of training
0:07:23can't feature allows the additive which allows
0:07:27second it is
0:07:29for supervised and has the training how clean data is required
0:07:33can i just using speech results of the
0:07:35below are created database
0:07:38mismatch issues
0:07:40currently you extractor and all seen in four
0:07:44are available pre-training on your emotions features can i used to train and has the
0:07:49network each works the height of features can get an idea get some benefit
0:07:57for this and has a really an expected data and of the training for the
0:08:01improvements
0:08:05faced is again and has features the bootstrap to training data double the amount of
0:08:10data and make our extra to store the be obvious four
0:08:16six is to see if the was less that we're working with a really useful
0:08:22during the data condition process
0:08:25is some of the noise class
0:08:27even harmful
0:08:30find regression is that as the proposed scheme for the task of dereverberation and joint
0:08:35denoising anteater operation
0:08:40or should be produce the baseline and see what there is good for differs a
0:08:45lost a extraction
0:08:48is
0:08:49results table with a lot of numbers a better for this doesn't station it's enough
0:08:55to focus on the first column which gives you the labels
0:08:59for that i all loss or data that's going to use
0:09:02and the final
0:09:04a column is the mean result on the
0:09:08no be retrained test set
0:09:11but shows without it has then given then one nine percent eer
0:09:15and then we have l d s l five between the feature last extracted from
0:09:20five layers
0:09:21and this
0:09:24on signal folk has six layers
0:09:26the fess up to five are used in this one and six is
0:09:30the
0:09:31classification in finding invariance we are not using for a particular role
0:09:36i guess the best performance and z more combinations
0:09:40to see
0:09:41and the l f l is the feature loss and it gives you were worse
0:09:46performance in and then baseline
0:09:48this reduces observations from previous four
0:09:52combining them was so
0:09:54is not good point two percent
0:09:57when you combine the embedding
0:10:00years the last layer false in that for the d feature lost
0:10:04it duh is also not helpful
0:10:07and then the use
0:10:09efficient loss five layer for later three layers two years and
0:10:14finally one layer and they are not as good as using all the layers
0:10:18the bottom half of the table is a decision cost function
0:10:23the
0:10:24observations are mostly same as the equal error rate
0:10:27so here we have seen the feature losses in three artificial are or system
0:10:34combining them
0:10:36is also for
0:10:38a more lazy use the best increase the computational complexity
0:10:44well that's okay
0:10:48the main data v is the
0:10:50you need to
0:10:51use you know if all silly layers from the jar
0:10:58if we see the choice of training data set for enhanced and also you know
0:11:01where
0:11:03we see donovan to dash fisher the blue means
0:11:07what's alone
0:11:08with the bodice and i was used for the
0:11:11and has and therefore and
0:11:13also
0:11:15on as a consequence for the
0:11:18also network and gives the best performance you know by boldface or
0:11:24one
0:11:26using p c which is the what's of the
0:11:29and b c we just have also combine
0:11:32but in spots of the combined with the
0:11:34the noise documentations
0:11:37we also from we see if two indian in the has to know where
0:11:43which is if you core
0:11:44the you can of the three persons of some kind of what's
0:11:48and it is not as
0:11:51good as the bodice not filtering so
0:11:55the shows that feeling screening all four
0:11:59barcelona one snr seems to be unimportant
0:12:03and use a little speech and
0:12:05can see of course and point to a greater than i one and baseline
0:12:11and solely for speech
0:12:13i think being
0:12:14in on conversational and mismatched data it is for training
0:12:20even when used as a
0:12:22clean counterpart for the enhanced
0:12:24and hence the network
0:12:28we also thing the powerful the also the network is that it is
0:12:33and the old one is so
0:12:38means that the more data is used and
0:12:40the data condition is also that
0:12:46you see if we mismatch the features and has the network can i use i
0:12:50dimensional features and hence for network
0:12:53second rule festival is
0:12:56ellen
0:12:57f b for the means log mel from the man features
0:13:01for the dimension in has been network
0:13:04recall that forty dimensional features are used in the opportunity for and the effectiveness of
0:13:10also
0:13:11show and this is the condition where the features are matched
0:13:15so i don't need to learn any bridge between networks for this case of a
0:13:21were four
0:13:22if you dimension wanted to do and menus spectrogram
0:13:28i there is a speech are mismatched and you need lower average between units as
0:13:33well
0:13:33and
0:13:34is the results are not as good as the matched condition
0:13:38seems like cannot it advantage of high dimensional features
0:13:43literal
0:13:44we also the spectrograms somehow since use of for a least or
0:13:50but it is also
0:13:51worse than the baseline
0:13:58you see the effect of hasn't you lda and the or extractor data
0:14:06for scroll is not as good as us to control was tested and then
0:14:11alright consisted percent
0:14:13that at home so we can see
0:14:17the lda common test is written
0:14:20as the label which means that be lda
0:14:23and it is also has
0:14:25and it does and so much rates it and seven percent
0:14:31so for the mindcf we have
0:14:38not much change so don't feel that the really is
0:14:43is on benefiting an entire susceptible to a enhancement processing
0:14:49if and hence the training set
0:14:52there is improvement for the start baseline
0:14:56which is an iterative system
0:14:58however it's not as good as just so that has in the test
0:15:02one and half of them since like
0:15:05the robustness of the whole system is lost so it's not working for at least
0:15:10four
0:15:12this corpus
0:15:16we combine the enhanced vision see if we can take advantage make them
0:15:22complementary original features
0:15:25no that wasn't we just means that even if a if conditions
0:15:30and half which means the and score of all the data
0:15:37in the column
0:15:39you see all can be lda that means
0:15:43meditation
0:15:45is then be in the
0:15:48to verify all can be lda
0:15:51vol including original features as a listing and switches along with the data
0:15:57it seems to be getting our
0:16:01and
0:16:02when i combined these features in training set
0:16:06is actually doing much better performance seems like the network analysis double data and
0:16:12there is also complement energy
0:16:15in the
0:16:17has features so they are
0:16:19it can be bonastre
0:16:22if i one station and the frame effect of these features in train as well
0:16:26as the and the lda it doesn't
0:16:32so this ensures that the lda is a suitable one hasn't processing
0:16:37i started to just put i has features that or is not in the training
0:16:41set up or a spoof an oak
0:16:46now we see if i e one type of noise class from the expected network
0:16:51r t
0:16:54and hasn't data
0:16:56so let's focus on the a lot of this table which is that the war
0:17:01music and
0:17:05see the last column we have i one zero five percent this means that
0:17:10right i skate
0:17:12using the music files from extract phonetic or and i also don't use enhancement actually
0:17:18doing better than the based on which means
0:17:20and then
0:17:21removing music is good so this discussed actually also performance
0:17:29next unseen means i used enhancement or what the
0:17:34the on has filter has not seen use it
0:17:36so it's still able to improve the one this is some and
0:17:41most interestingly
0:17:43and the use the
0:17:44units seeing which is
0:17:46when i using and has to network which has seen using it is the s
0:17:51so it seems like some noise classes are
0:17:55or are being
0:17:57and
0:17:59is that it just give them in x vector training
0:18:02okay include them in the
0:18:05a enhancement
0:18:07training data
0:18:11it to see if we can do you relation with division loss you try seven
0:18:16schemes
0:18:17use call so that would be e tradition earlier repetitions scheme trying to do you
0:18:25duration and utilizing in
0:18:27joint fashion
0:18:29also and the distance fashioned which is denoted by joint one stage
0:18:33a few we all these numbers
0:18:37in c
0:18:39the dereverberation is not actually working
0:18:42we also suspect that's possible that a there is not possible configuration nevertheless t-norm things
0:18:50since e
0:18:53you have
0:18:56a pre-processing step for a improving on this maybe straight
0:19:02finally database are you can you need to choose also you know for you have
0:19:08layers of it for this type of funding
0:19:12and use one isa nine based filtering to keep highest not only you scores from
0:19:16the
0:19:17a construct a clean data for has to network training
0:19:21the mismatch in and has to and hasn't and also very
0:19:25and it is slightly worse is better to use same features
0:19:29we see that the lda is not really
0:19:33us a nice it's very susceptible to using enhanced data american put this next fortunate
0:19:38for
0:19:39some noise types are harder in for extracting data like music
0:19:45and finally the duration is not or four
0:19:50using this
0:19:52state of training scheme
0:19:54so that is the end of the presentation please feel free to send questions that
0:19:58where we thank you