0:00:23 | oh |
---|
0:00:30 | yeah |
---|
0:00:32 | C nine i can't wait to present |
---|
0:00:34 | oh |
---|
0:00:41 | the first is |
---|
0:00:42 | instead |
---|
0:00:55 | and the for this call |
---|
0:00:57 | i was |
---|
0:00:58 | you |
---|
0:00:59 | all right |
---|
0:01:01 | application speaker |
---|
0:01:03 | maybe may not be as well this work is a well known |
---|
0:01:07 | you may not know whether you want to pay attention to this thing so let |
---|
0:01:09 | me just summarise it |
---|
0:01:11 | oh gee by |
---|
0:01:13 | proof speaker |
---|
0:01:16 | yeah |
---|
0:01:17 | one of the interested in that have to pay attention |
---|
0:01:22 | okay so let me |
---|
0:01:26 | in this |
---|
0:01:29 | you're |
---|
0:01:30 | just |
---|
0:01:31 | duration |
---|
0:01:32 | lee |
---|
0:01:33 | speech processing |
---|
0:01:36 | yeah |
---|
0:01:42 | five channel |
---|
0:01:48 | really |
---|
0:01:50 | some knowledge about this |
---|
0:01:51 | discounts characteristics present in the acoustic signal could be helpful for improving performance of such |
---|
0:02:00 | recognition system in our case you interested in speaker recognition system |
---|
0:02:04 | and the svm already before that using such information speech |
---|
0:02:08 | we also very informational side information in the fusion calibration and how well |
---|
0:02:14 | and |
---|
0:02:15 | most of the recent approaches would always |
---|
0:02:18 | the L one recent words like to put people do for nist evaluations that usually |
---|
0:02:23 | below |
---|
0:02:24 | oh independent detectors for various |
---|
0:02:28 | i |
---|
0:02:28 | information various detectors various |
---|
0:02:31 | estimators like estimators of a signal to noise ratio reverberation a detectors of language and |
---|
0:02:38 | so on |
---|
0:02:39 | and what we propose in this work is to detect a |
---|
0:02:43 | this acoustic condition based on direct everybody else i-vectors nowadays be yeah time to show |
---|
0:02:48 | that you can just the i-vectors |
---|
0:02:51 | detect all kinds of other different |
---|
0:02:54 | acoustic condition information or nuisance condition or call it is like from the i-vector and |
---|
0:03:03 | then use this information in quite simple way |
---|
0:03:06 | calibration |
---|
0:03:08 | fusion speaker |
---|
0:03:09 | this |
---|
0:03:11 | just some |
---|
0:03:14 | maybe the |
---|
0:03:15 | most important previous work on making use of |
---|
0:03:19 | i is the condition detection that you have |
---|
0:03:22 | in the past you may still remember feature mapping where features are compensated based on |
---|
0:03:28 | detecting acoustic condition or |
---|
0:03:30 | more specifically channel the of the signal and then |
---|
0:03:35 | this work in that for the channels that they just based on |
---|
0:03:39 | it's channel specific gaussian mixture models but then we thought that the |
---|
0:03:43 | don't need to detect a channel condition or acoustic conditions anymore because you got this |
---|
0:03:48 | wonderful is the joint factor analysis and i-vectors and we thought that you don't have |
---|
0:03:53 | to really explicitly detect the condition that the channel compensation scheme will account for this |
---|
0:04:00 | variability what is intersession variability directly but again we saw that using some side information |
---|
0:04:07 | in a calibration or fusion was actually helping even so that you're compensating using |
---|
0:04:14 | space methods |
---|
0:04:17 | oh |
---|
0:04:19 | again the side information that we have been using the side already mentioned recently where |
---|
0:04:24 | extracted by language identification systems also signal-to-noise the racial estimators and we collect all kinds |
---|
0:04:31 | of information about the signal like that it will try to make use of it |
---|
0:04:35 | in to improve speaker identification system and there are the request in several different ways |
---|
0:04:42 | of using such information so they probably because this thing was |
---|
0:04:46 | i just a bill |
---|
0:04:48 | evolution condition specific fusion or calibration so |
---|
0:04:53 | okay you were trained different calibration for a specific condition like specific duration or |
---|
0:04:58 | or english only trials |
---|
0:05:02 | spoken by different languages |
---|
0:05:04 | or more than anything it was something that possible cm actually started with that and |
---|
0:05:09 | i think the nickel remark one it is focal toolkit was by linear score can |
---|
0:05:15 | be defined by linear score side information combination where a that was bilinear form the |
---|
0:05:21 | new interaction between the scores |
---|
0:05:24 | and the side information itself |
---|
0:05:30 | all speech so i mean |
---|
0:05:31 | mention when using |
---|
0:05:34 | where |
---|
0:05:35 | collecting |
---|
0:05:36 | information about try itself rather than |
---|
0:05:40 | the in digital recording integral |
---|
0:05:43 | i |
---|
0:05:44 | segmental speech so that was |
---|
0:05:46 | this side information will be is this child actually |
---|
0:05:51 | could be different recordings in the trial come from different languages or other people's of |
---|
0:05:56 | these short duration of all these long duration something that we have tried |
---|
0:06:00 | in two thousand and politicians was finally to get side information from individual segments and |
---|
0:06:08 | combine these side information from individual segments in certain way |
---|
0:06:12 | to improve again calibration fusion this is something that people i'm using but we would |
---|
0:06:17 | be a four in the morning this |
---|
0:06:20 | in this talk |
---|
0:06:21 | and let me maybe |
---|
0:06:24 | splendid more closely |
---|
0:06:27 | oh |
---|
0:06:28 | so what is our approach them to do |
---|
0:06:33 | this acoustic conditions |
---|
0:06:35 | as i said we are going to use |
---|
0:06:38 | i-vectors that's as an input to classifier to detect predefined set of various audio characteristics |
---|
0:06:45 | in our is that when you just simple linear gaussian classifier |
---|
0:06:49 | which is a similar thing that people you're using for i-vector based language identification |
---|
0:06:56 | and |
---|
0:06:59 | the way we are going to represent the audio characteristics of the signal would be |
---|
0:07:03 | just vector of posterior probabilities of these individual |
---|
0:07:07 | and |
---|
0:07:08 | oh yeah like to show that we can use this vector of all ones are |
---|
0:07:12 | all these this us and side information for |
---|
0:07:15 | a fusion and calibration of speaker a system and get quite substantial |
---|
0:07:20 | where |
---|
0:07:21 | improvement in performance |
---|
0:07:23 | and in this work we actually using exactly the same i-vectors for both |
---|
0:07:28 | characterization of the audio segment and speaker recognition and |
---|
0:07:33 | a justification for this thing is |
---|
0:07:37 | reasoning that we have is that |
---|
0:07:39 | or maybe nuisance characteristics including an i-vector itself affect speaker id performance so if we |
---|
0:07:45 | can the take those characteristics from the i-vector itself those should be those most important |
---|
0:07:51 | for |
---|
0:07:52 | for |
---|
0:07:53 | improving speaker recognition performance for |
---|
0:07:56 | compensating for the effects are still there in i-vectors |
---|
0:08:00 | or |
---|
0:08:03 | oh before i get into more details and on what exactly do than actually the |
---|
0:08:08 | results |
---|
0:08:09 | oh let me introduce the a development and relations that used in this work we |
---|
0:08:14 | should give you some idea what kind of partly this and kind of condition you |
---|
0:08:19 | actually |
---|
0:08:20 | and |
---|
0:08:21 | so the data but we have used this prism evaluation set which is something that |
---|
0:08:26 | you have presented |
---|
0:08:27 | i during the last nist |
---|
0:08:29 | workshop |
---|
0:08:30 | and |
---|
0:08:32 | decided that you have |
---|
0:08:34 | collected and pretty well collected by the database was something that we therefore |
---|
0:08:39 | best project and we see this comparison of the design for actress |
---|
0:08:45 | many dataset so would be data that yeah are comes from feature sre you evolution |
---|
0:08:52 | take a speech for you would say so these are basically that everybody uses for |
---|
0:08:56 | for training system for nist evolution but we try to bill evaluation set that accounts |
---|
0:09:02 | for different kinds of our abilities |
---|
0:09:04 | so that was a huge investment in normalizing all the old in a method a |
---|
0:09:09 | information of all the files and we try to include as many trials and read |
---|
0:09:14 | as many trials as we could |
---|
0:09:16 | we try to create trials for a specific types of variability so we do lots |
---|
0:09:20 | of evaluation conditions for the specific types of variability |
---|
0:09:24 | like different styles in general for to the different vocal for language portability usually the |
---|
0:09:31 | conditions that the oldest the specific type of a really db we didn't try to |
---|
0:09:35 | really makes different types |
---|
0:09:37 | where |
---|
0:09:38 | the different |
---|
0:09:39 | and so on |
---|
0:09:40 | right now results and see what is the different types of variability courses |
---|
0:09:46 | what degradation |
---|
0:09:47 | i |
---|
0:09:48 | it's and i never explicitly always tried to break more trials compared to what has |
---|
0:09:55 | been defined for installations |
---|
0:09:58 | oh we also tried to introduce new types of variability in this they are specifically |
---|
0:10:03 | was |
---|
0:10:04 | noise variability reverberation so we artificially added actually non sound and reverberation the a i |
---|
0:10:10 | don't |
---|
0:10:11 | a few more this on the next slide |
---|
0:10:14 | they should be also duration for condition |
---|
0:10:18 | for each mixing |
---|
0:10:22 | at this prison set consists of two parts that elements to be better at the |
---|
0:10:27 | moment we have a one thousand speakers around thirty K thirty thousand audio files |
---|
0:10:34 | for more than seven |
---|
0:10:36 | seventy million five so that was so this is actually relations that |
---|
0:10:41 | and then you have a tree |
---|
0:10:44 | sixty thousand speakers |
---|
0:10:48 | comes from |
---|
0:10:49 | a hundred thousand |
---|
0:10:51 | sessions |
---|
0:10:52 | oh |
---|
0:10:55 | deciding this task just to give you some idea wasn't really just as easy as |
---|
0:10:59 | taking saying of the we used features switchboard and sre four for training the rest |
---|
0:11:05 | for four testing either really and attention to reduce to the data from different |
---|
0:11:11 | different |
---|
0:11:12 | sets to use the models for training and testing so for example to get some |
---|
0:11:16 | language portability we have used for evolution data from five |
---|
0:11:20 | the same are trained for different |
---|
0:11:23 | yeah |
---|
0:11:25 | i |
---|
0:11:27 | oh yeah |
---|
0:11:28 | for all i |
---|
0:11:30 | we try to use them for |
---|
0:11:31 | eventually for testing but at the same time we wanted to cover some of the |
---|
0:11:35 | channels that are some of the microphones and i two thousand they don't they also |
---|
0:11:40 | training also so |
---|
0:11:43 | they be related to |
---|
0:11:46 | pay attention |
---|
0:11:48 | splitting the database is very like this |
---|
0:11:52 | last number of trials |
---|
0:11:54 | i |
---|
0:11:55 | you see |
---|
0:11:56 | straight patients |
---|
0:11:58 | i don't i don't i just try to quickly summarise the bigger for designing the |
---|
0:12:03 | noisy and reverberant data what exactly do what's really can be some design so that |
---|
0:12:09 | we define the way how to for all that the noise and what |
---|
0:12:14 | we try to use open source tools and principles the other noises that added to |
---|
0:12:19 | the data and county somerset is that this other people are interested in adding new |
---|
0:12:24 | a noisy is it should be straightforward to just for the rest is that you |
---|
0:12:28 | have designed a new types of noises new also reverberations |
---|
0:12:34 | so i mean |
---|
0:12:36 | just the in the blue box it pretty much summarizes the additive noise but also |
---|
0:12:41 | use this file for |
---|
0:12:43 | for |
---|
0:12:45 | just adding the be |
---|
0:12:48 | noise to the data as a specific S |
---|
0:12:50 | the signal to noise ratio and you have used different noise is the kind of |
---|
0:12:54 | course there are only is not use different kind of noises for |
---|
0:13:00 | for training data for enrollment trials for enrollment segments for test segments of try to |
---|
0:13:06 | make sure that you never train or test that not even the same noise are |
---|
0:13:10 | not even noise taken from the same |
---|
0:13:13 | final or not even that exactly the same time |
---|
0:13:16 | so if i say that was a cocktail party |
---|
0:13:19 | i noise to be noise from restaurant noise from our for so different kind of |
---|
0:13:25 | pop noises and make sure that really makes this |
---|
0:13:28 | very similar |
---|
0:13:31 | training |
---|
0:13:32 | still |
---|
0:13:32 | and you have added the noise to the data at different snrs specifically |
---|
0:13:37 | twenty fifty eight |
---|
0:13:41 | the noise was actually added to clean data for these data are wrong |
---|
0:13:47 | thousand and four |
---|
0:13:51 | my |
---|
0:13:52 | the data should be |
---|
0:13:54 | right before |
---|
0:13:59 | similarly defined is a reverberant subset of the data a which again use this |
---|
0:14:05 | we're to which is open source |
---|
0:14:06 | for simulating a rectangle or impulse responses from a rectangular room set and then added |
---|
0:14:14 | reverberation at different |
---|
0:14:17 | reverberation times to date and again pay attention to at the same time reverberation to |
---|
0:14:25 | training and test data |
---|
0:14:27 | destroying |
---|
0:14:28 | i |
---|
0:14:29 | inputs |
---|
0:14:33 | okay so we get the time how to the sounds like not be |
---|
0:14:39 | and you more details on be characterization six |
---|
0:14:42 | so they system itself |
---|
0:14:44 | is based as i said i-vector the i-vector |
---|
0:14:47 | is pretty much the standard i-vector extractor that everybody using nowadays a ubm based on |
---|
0:14:54 | gaussian train and that's |
---|
0:14:56 | my |
---|
0:14:57 | you mean and variance |
---|
0:14:59 | a extract actually si substrate exactly the same as for speaker identification string only one |
---|
0:15:07 | speech frames are trained on |
---|
0:15:11 | silence |
---|
0:15:12 | but it's quite possible that for detecting |
---|
0:15:14 | different |
---|
0:15:15 | it is |
---|
0:15:16 | you just features like may be applied |
---|
0:15:19 | normalization |
---|
0:15:21 | so |
---|
0:15:22 | so |
---|
0:15:24 | i |
---|
0:15:27 | or the U six hundred dimensional i-vectors are extracted from the standard variability space that |
---|
0:15:33 | is assumed to |
---|
0:15:35 | are expected to contain information about speaker |
---|
0:15:39 | acoustic conditions in this case we didn't do any of the eight |
---|
0:15:45 | oh |
---|
0:15:45 | for the speaker information so use the i-vector |
---|
0:15:52 | channel and the length normalization for this |
---|
0:15:58 | channel |
---|
0:15:59 | position |
---|
0:16:01 | so as a classifier is to use linear gaussian classifier trained on these i-vectors about |
---|
0:16:08 | oh |
---|
0:16:09 | what is trained for classify a |
---|
0:16:14 | conditions is that i'm going to show on the next slide and a final diarization |
---|
0:16:21 | characterization present this paper |
---|
0:16:23 | posterior probabilities be specified classes so in fact they |
---|
0:16:27 | taking this vector is |
---|
0:16:28 | simple as a nonlinear function forty five and |
---|
0:16:32 | this is just a simple as i mean |
---|
0:16:34 | affine transformation i-vector for mass function |
---|
0:16:39 | take this |
---|
0:16:41 | posture |
---|
0:16:43 | this summarizes the whole system works |
---|
0:16:48 | and as well as you can see that is |
---|
0:16:51 | i-vector |
---|
0:16:54 | such as |
---|
0:16:54 | is that |
---|
0:16:55 | i |
---|
0:16:56 | as you can see |
---|
0:16:57 | the same training data variance |
---|
0:17:00 | post-training |
---|
0:17:02 | train a ubm training the subspace matrix and also |
---|
0:17:07 | option |
---|
0:17:08 | i |
---|
0:17:11 | i |
---|
0:17:12 | so |
---|
0:17:15 | okay |
---|
0:17:18 | based on this actually |
---|
0:17:21 | our system for |
---|
0:17:23 | so we try to distinguish between |
---|
0:17:26 | three dollars |
---|
0:17:27 | a microphone they are |
---|
0:17:29 | a noisy and this case where |
---|
0:17:33 | those kind of noisy data that you are actually noise added to the clean originally |
---|
0:17:38 | clean microphone the a and B distinguish three different conditions which is noise a db |
---|
0:17:44 | fifteen db and twenty db snr |
---|
0:17:46 | and the conditions for are currently covered we define the condition according to reverberation time |
---|
0:17:54 | three five |
---|
0:17:55 | zero point |
---|
0:17:58 | you can see how much data used for training data |
---|
0:18:02 | and |
---|
0:18:03 | hence |
---|
0:18:05 | i |
---|
0:18:06 | right and soon as you can see there is always the same number of training |
---|
0:18:10 | and test files for |
---|
0:18:13 | noise |
---|
0:18:14 | because those are actually |
---|
0:18:15 | the same file |
---|
0:18:17 | and noise in different level noise |
---|
0:18:23 | the way to those classes because we just the vector posture only use of those |
---|
0:18:28 | classes defined |
---|
0:18:29 | assume that the classes are usually X |
---|
0:18:32 | that's |
---|
0:18:33 | which is exactly this |
---|
0:18:35 | green and with or elevation data because this is exactly how our evaluation set was |
---|
0:18:41 | designed |
---|
0:18:41 | we |
---|
0:18:42 | never have reverberation and noise in the same recording |
---|
0:18:47 | but of course this is unrealistic |
---|
0:18:49 | in relatively you can |
---|
0:18:52 | reverberation of the army |
---|
0:18:55 | background i |
---|
0:18:57 | oh |
---|
0:18:58 | still be viewed that using this paper would be useful for such conditions because this |
---|
0:19:05 | all the vectors of posteriors can account for |
---|
0:19:07 | or something just conditions in the data also |
---|
0:19:12 | one |
---|
0:19:21 | this is |
---|
0:19:22 | i animation that they |
---|
0:19:25 | sure |
---|
0:19:26 | where if you have reported that comes from my |
---|
0:19:29 | comes from |
---|
0:19:30 | all then we do this estimation you probably get all through that somehow reflects the |
---|
0:19:35 | that i |
---|
0:19:36 | there is |
---|
0:19:38 | stands for my |
---|
0:19:39 | yeah probably |
---|
0:19:40 | my |
---|
0:19:45 | what how much |
---|
0:19:47 | yes |
---|
0:19:48 | yeah |
---|
0:19:51 | but of course we can go for more principled way we can even a little |
---|
0:19:55 | independent classifiers for these independent types of articulators of that |
---|
0:20:02 | classifiers also speech and noise which kind of reverberation level of reverberation but it still |
---|
0:20:07 | a microphone was it would be trained data which contains a mix of |
---|
0:20:13 | such |
---|
0:20:13 | such conditions |
---|
0:20:14 | so this |
---|
0:20:16 | oh |
---|
0:20:18 | table summarizes |
---|
0:20:20 | what performance be obtained in terms of that i think these conditions |
---|
0:20:25 | so they the table shows of the two classes and the detected classes |
---|
0:20:32 | i |
---|
0:20:34 | and i know that i'm supposed to be pressing enter |
---|
0:20:37 | space |
---|
0:20:40 | so they |
---|
0:20:42 | the if you had a perfect classification we should see numbers hundred and diagonal and |
---|
0:20:48 | zero |
---|
0:20:49 | as well as the justice confusion matrix and normalized in such a way that you |
---|
0:20:57 | i |
---|
0:20:59 | persons |
---|
0:21:00 | i can see that this didn't really have a what |
---|
0:21:03 | what is |
---|
0:21:05 | what you were pleased with was that we could actually see that at least |
---|
0:21:11 | recognition microphone and telephone data is almost |
---|
0:21:15 | right so that almost here for microphone we get some confusion |
---|
0:21:20 | here we might |
---|
0:21:22 | and |
---|
0:21:22 | twenty db |
---|
0:21:24 | noisy data and as i told you actually great it is a noisy i think |
---|
0:21:29 | these microphone data and adding noise to |
---|
0:21:32 | exactly this i twenty db snr and if you listen to those clear that some |
---|
0:21:37 | of the data actually contained some voice so it's quite natural that some of the |
---|
0:21:41 | states from the clean microphone |
---|
0:21:43 | becomes |
---|
0:21:44 | twenty db which is not |
---|
0:21:47 | kind of like |
---|
0:21:49 | in a year |
---|
0:21:50 | condition |
---|
0:21:51 | oh also if you look at the a different noise levels we see quite reasonable |
---|
0:21:58 | performance of base |
---|
0:22:00 | a reasonably large numbers in that all again like nicely twenty db recognise that there |
---|
0:22:05 | is some confusion |
---|
0:22:07 | but this is again something that would be expected specially the S and now ratios |
---|
0:22:11 | which are close to each other should be actually |
---|
0:22:14 | use |
---|
0:22:14 | and |
---|
0:22:16 | a what one thing that actually seen was that the most of the confusion comes |
---|
0:22:21 | with some type of noise they don't really affect the i-vector much i think this |
---|
0:22:25 | type of noise resulted in almost |
---|
0:22:28 | exactly the same i-vector and something that was also naturalness you get |
---|
0:22:35 | where you don't do very well i'll be |
---|
0:22:41 | these conditions where maybe try to detect the reverberation time |
---|
0:22:46 | and we see that the those they thought of those detections are actually comes all |
---|
0:22:51 | over the place you really confused for |
---|
0:22:54 | for noisy at a party |
---|
0:22:58 | the main reason that we believe that is the thing is happening is that redefining |
---|
0:23:02 | conditions reverberation time is not actually a good thing to do because the reverberation be |
---|
0:23:09 | if you played reverberations then you could actually hear that one |
---|
0:23:14 | one |
---|
0:23:15 | one type of reverberation at one reverberation time was much models are just a perceptually |
---|
0:23:20 | to another reporting which was completely different reverberation times of the reverberation time is probably |
---|
0:23:25 | the right consonant as we apply the C using these data for improving speech recognition |
---|
0:23:32 | actually improves the speaker recognition performance we actually looks like that the classification itself does |
---|
0:23:38 | a good job in terms of |
---|
0:23:39 | classifying things in putting things into the right classes |
---|
0:23:43 | which allows us to degrade |
---|
0:23:46 | so finally how to use this information about acoustic condition for the calibration for improving |
---|
0:23:55 | the speaker recognition system |
---|
0:23:57 | cells of you see this approach that the that no an echo actually proposed for |
---|
0:24:03 | when we do our the A B C system for nist sre two thousand and |
---|
0:24:09 | you also and i believe this is the thing that is implemented in |
---|
0:24:12 | in both source to that |
---|
0:24:14 | three available |
---|
0:24:16 | so the idea is to be just one calibration if we review this and tara |
---|
0:24:24 | people obtain standard linear combination where you know a some bias and some multiply the |
---|
0:24:29 | experiments with |
---|
0:24:31 | switching from be touches |
---|
0:24:33 | oh nickel |
---|
0:24:35 | okay |
---|
0:24:35 | and that's of the device is the |
---|
0:24:39 | wavelet multiply the scores |
---|
0:24:41 | but you can see that we actually in some bias term which is just your |
---|
0:24:45 | combination |
---|
0:24:46 | between the vector of posteriors from one and the second |
---|
0:24:50 | segment that are space in the |
---|
0:24:54 | trial and then based some matrix so this linear phone that is |
---|
0:24:59 | vectors in there |
---|
0:25:01 | from |
---|
0:25:02 | is |
---|
0:25:03 | i |
---|
0:25:04 | just bias and this is that the final score this is to go next |
---|
0:25:10 | for |
---|
0:25:11 | also |
---|
0:25:12 | so we just mentioned before |
---|
0:25:16 | for |
---|
0:25:17 | okay |
---|
0:25:18 | the same |
---|
0:25:21 | vectors |
---|
0:25:24 | you're just list conditions |
---|
0:25:26 | running times and let me just |
---|
0:25:29 | say briefly that we are presenting results on one list of conditions that a subset |
---|
0:25:34 | of all conditions summarising and these are these |
---|
0:25:38 | we know that no problem of that a lot of entry |
---|
0:25:44 | microphone just my own different vocal for different languages in the recordings of different noises |
---|
0:25:52 | in recording room reverberation and the system that we use for speaker id |
---|
0:25:58 | used exactly this |
---|
0:25:59 | i |
---|
0:25:59 | as i say |
---|
0:26:01 | once invited me |
---|
0:26:03 | right normalization lda are used as a |
---|
0:26:07 | train the presence of and |
---|
0:26:10 | this slide just two |
---|
0:26:12 | results and you can see that you are actually nice |
---|
0:26:16 | on so these are |
---|
0:26:18 | once in principle |
---|
0:26:19 | the dcf and eer |
---|
0:26:21 | fine |
---|
0:26:22 | relations |
---|
0:26:23 | can see so maybe less |
---|
0:26:25 | we just are relevant for |
---|
0:26:28 | because |
---|
0:26:28 | yeah |
---|
0:26:29 | not |
---|
0:26:30 | from |
---|
0:26:32 | from this condition which are actually the condition of the conversation but recorded over a |
---|
0:26:38 | microphone somewhere our prediction actually does very job |
---|
0:26:43 | oh surprisingly get some improvements in |
---|
0:26:46 | also used it all comes |
---|
0:26:49 | from the single condition that you have a probably again can do some two one |
---|
0:26:53 | detecting voice is that you have a does a good job on detecting on the |
---|
0:26:58 | noise condition |
---|
0:27:00 | from different |
---|
0:27:01 | noise levels |
---|
0:27:02 | i |
---|
0:27:03 | thus reasonable job on room reverberation actually in |
---|
0:27:08 | right |
---|
0:27:10 | so only conditions but they are quite |
---|
0:27:13 | anyway for speaker identification proposed we don't get any problem at all the that comes |
---|
0:27:19 | from just one is to be don't have conditioned and then tell us to improve |
---|
0:27:24 | the thing we do not get improvements for language and a common words that |
---|
0:27:31 | again we didn't really have condition |
---|
0:27:33 | thus |
---|
0:27:34 | so |
---|
0:27:35 | and the next slide actually just showing the same thing that we also still pretty |
---|
0:27:40 | much the same gains you can be fused with |
---|
0:27:43 | system cepstral prosody just |
---|
0:27:45 | so this is just to say |
---|
0:27:47 | suspecting summarizes |
---|
0:27:49 | and |
---|
0:27:50 | the conclusions are summarized in practice |
---|
0:27:58 | i |
---|
0:28:06 | well |
---|
0:28:10 | oh |
---|
0:28:14 | so well what is that it doesn't classify people are |
---|
0:28:18 | issues that you have in training and test |
---|
0:28:22 | actually is if i say i reverberation is rubber reverberation time domain is different for |
---|
0:28:28 | reverberation but they come from not only are artistry of the actual five |
---|
0:28:33 | and he defined the reverberation time just cost is |
---|
0:28:37 | what we have seen is that if you listen to the recordings that in test |
---|
0:28:43 | fine two recordings that sound similar or perceptually but they come from different classes we |
---|
0:28:49 | just the way defined the cost is probably wasn't where i think |
---|
0:28:53 | direction part is how the possibility fine it's probably not correct that would be more |
---|
0:28:57 | natural clustering a more natural clustering that would account for the type of reverberation i |
---|
0:29:03 | mean you and regression that you if you flat there is nothing for about that |
---|
0:29:07 | kind of you late reverberation order reverberation that spread over all the time |
---|
0:29:12 | which will affect the speech and |
---|
0:29:16 | what else |
---|
0:29:18 | would be in our case may be considered to be come from the same part |
---|
0:29:21 | so then you probably can be some classes which about related speaker recognition performance and |
---|
0:29:27 | it helps at the end in the in the |
---|
0:29:29 | and the speaker recognition performance even for that |
---|
0:29:32 | the classification that would but the classification is not because we define |
---|
0:29:37 | and |
---|
0:29:42 | i |
---|
0:29:43 | i |
---|