0:00:01 | and everybody |
---|
0:00:02 | i welcome you in my story on this thing in automatic speaker recognition |
---|
0:00:08 | i'm a similarity score on assistant professor at your local news data |
---|
0:00:12 | frames you're looks cool |
---|
0:00:14 | and there's some other regions |
---|
0:00:18 | at a low overall difference was moving detection rate of cognition we first you all |
---|
0:00:26 | speaker verification |
---|
0:00:28 | giving more attention to current research plan and progress |
---|
0:00:32 | in the middle and all this information for a speech systems |
---|
0:00:37 | but also we don't to the cost |
---|
0:00:43 | automatic speaker verification is one of the most convenient enough room means of but you |
---|
0:00:48 | might also recognition |
---|
0:00:51 | this is why this technology is values from your application services such a smart phones |
---|
0:00:56 | small speaker single sensors |
---|
0:01:00 | it's technology has about a lot over the last years based data that a is |
---|
0:01:06 | increasing the we need of by the premier network solution |
---|
0:01:09 | so just it's vector |
---|
0:01:11 | we to some extent is weaker than traditional gaussian mixture models |
---|
0:01:16 | or the so-called i-vectors |
---|
0:01:18 | and when the roaches are also emerging |
---|
0:01:22 | we guess at the speaker recognition technology s probably reach the level of performance required |
---|
0:01:29 | so or practical issue |
---|
0:01:33 | it wasn't no is whether or not the remaining system is one a normal to |
---|
0:01:37 | what we're gonna be the answer is yes |
---|
0:01:41 | the reality of voice biometric technology can be compromised by political status namely born and |
---|
0:01:48 | ability to the technology external |
---|
0:01:51 | one of the measures trees the security of biometric systems are spoofing attacks |
---|
0:01:57 | there is there are four |
---|
0:01:59 | the final severe okay stores carry out of whatever you matrix system into recognising and |
---|
0:02:04 | legitimate user is a general user order to avoid being recognised |
---|
0:02:10 | this is achieved by presenting to this is a synthetic for all the money we |
---|
0:02:16 | bash |
---|
0:02:18 | or the volume at least eight |
---|
0:02:20 | but before we locate is a are the second walk ons this system is processed |
---|
0:02:29 | there is this is then try to answer this question is that there's on what |
---|
0:02:34 | they say the are |
---|
0:02:37 | this means that the target that idea in this case studies as well as a |
---|
0:02:41 | non-target trial the t v |
---|
0:02:44 | can be a set the origin by speaker verification system |
---|
0:02:50 | this results in two different types of errors name false alarms and false rejection |
---|
0:02:56 | as shown in table |
---|
0:02:59 | only if this user used a and a change dataset or that this user is |
---|
0:03:06 | an bolster the challenge i |
---|
0:03:09 | there is a v |
---|
0:03:10 | system based |
---|
0:03:12 | according to their change |
---|
0:03:14 | here target speaks when they are now available is whining boxers makes no f or |
---|
0:03:20 | when there's anything about |
---|
0:03:25 | so |
---|
0:03:26 | given a test right it is we provide some score behind the score integrator the |
---|
0:03:32 | confidence that the speaker voices |
---|
0:03:36 | a better discrimination you see green order to increase in body then between target trials |
---|
0:03:41 | and non-target trial scores by selecting a threshold between the leash motion looks coarse |
---|
0:03:47 | however as trying to figure that in the non-target score distribution |
---|
0:03:53 | usually overlap region |
---|
0:03:55 | this is can you being the detection error tradeoff at school |
---|
0:04:00 | on the right well the point where the false alarm rate is in well to |
---|
0:04:05 | the force the |
---|
0:04:06 | a certain three is cool enquiry |
---|
0:04:09 | is this really realistic |
---|
0:04:11 | though the impostor may can you have for performing system |
---|
0:04:15 | or they can implement it is if you is my task |
---|
0:04:20 | so they aim at all that is to provoke false alarms by increasing easily classifier |
---|
0:04:26 | scores target while i'm going detection |
---|
0:04:30 | we can distinguish costly to get in bolster from an eye impostor |
---|
0:04:34 | there are there are also going to zero for impostors |
---|
0:04:41 | the processing to create fake speech signal you know it down for let's see that |
---|
0:04:47 | the challenge here is to find a solution to that there are many valuable and |
---|
0:04:52 | involving this process and there are still menu question to ask |
---|
0:04:58 | do their car from linear earlier processing due only receive you know part of the |
---|
0:05:04 | spectrum should be able to look also and the phase signal |
---|
0:05:09 | but something this question later when we have more element goods you are |
---|
0:05:16 | there are many a general approaches for the measures improving the easily robustness for example |
---|
0:05:23 | by speech or the u r c d this is an invasion that action |
---|
0:05:28 | or winded executive countermeasures for example based on that for sure |
---|
0:05:33 | this is and its energy detection |
---|
0:05:37 | in this legal issue on an example that plot you stating baseline performance is when |
---|
0:05:43 | they posters are non-zero for impostors |
---|
0:05:47 | baseline black line |
---|
0:05:49 | the performance degradation when data getting both |
---|
0:05:53 | by the system |
---|
0:05:55 | so is this also that the red line |
---|
0:05:57 | and improvement of the performance is where they can to measure the client |
---|
0:06:03 | this is the one dimensional fashion |
---|
0:06:05 | rule i |
---|
0:06:06 | and know that on a meeting with perfect countermeasures those this is the best performance |
---|
0:06:12 | reach its baseline performance |
---|
0:06:18 | nobody six including voice volume it is becoming an instance |
---|
0:06:22 | many speaker pointed out there is usually issues |
---|
0:06:26 | can think speech |
---|
0:06:29 | decision can undermine confidence in easy and it is important you regional level of control |
---|
0:06:36 | measure of presentation that detection to reduce false acceptances |
---|
0:06:42 | to spoofing attacks |
---|
0:06:46 | does that this additional tasks can be originated from more efficient synthesis |
---|
0:06:51 | or voice |
---|
0:06:52 | in unlogical system old or just we recording related approach you know basic process |
---|
0:06:58 | well |
---|
0:07:00 | where we enjoy directly the audio stream in the easy my |
---|
0:07:06 | these four percent the measured rates |
---|
0:07:09 | and a time or a is impersonation which ones used in dating a human voice |
---|
0:07:17 | also the tree to but this condition is not only inter school and twenty minutes |
---|
0:07:23 | studies |
---|
0:07:24 | involving small datasets |
---|
0:07:26 | it is not surprising a |
---|
0:07:28 | that |
---|
0:07:29 | there is no previous work misleading countermeasures maybe impersonation |
---|
0:07:37 | a possible location of that point the in time typical icily system maybe before or |
---|
0:07:45 | after the microphone as illustrated in three |
---|
0:07:48 | corresponding to physical access and logical |
---|
0:07:53 | is he is more or something then older biometric system based on different biometric is |
---|
0:07:59 | just conceded that symbols of a human persons goal is can be collected the really |
---|
0:08:04 | bystanders to face to face or telephone conversation |
---|
0:08:09 | and then blame in order to my twenty a day is just |
---|
0:08:14 | or more advanced voice conversion or speech synthesis algorithms |
---|
0:08:19 | in used to generate particular |
---|
0:08:22 | if it is looking at that |
---|
0:08:24 | using only modest amounts of voiced the calculate the for a person |
---|
0:08:32 | this table summarize the for splitting and that's in terms of us a single decreases |
---|
0:08:37 | and in we will consider measures |
---|
0:08:40 | except for the impersonation at time so that have a menu model i s is |
---|
0:08:45 | unity |
---|
0:08:47 | and i freeze |
---|
0:08:48 | especially for text event is the scenario and the error of intermediate of dimension |
---|
0:08:55 | that's the use of for scroll |
---|
0:08:58 | generalization it is the meeting to the different |
---|
0:09:02 | or unseen i |
---|
0:09:07 | so this is the timeline which the task |
---|
0:09:10 | two days visible units you |
---|
0:09:13 | and is studies on speaker and feasible thing where and are on me now speech |
---|
0:09:19 | for were created using a limited number or something |
---|
0:09:23 | in see it is clear that the development of can to measure using only a |
---|
0:09:29 | small number was looking at task |
---|
0:09:31 | no you generalization to be |
---|
0:09:35 | moreover |
---|
0:09:36 | there was a lack of a galaxy we will corpora and evaluation bottle but not |
---|
0:09:42 | for the to the results of being by different researchers |
---|
0:09:48 | daisy of this study aims to establish a key during the initial you by making |
---|
0:09:56 | of evil standard speech corpora |
---|
0:09:58 | we have a large amount of signal that's |
---|
0:10:01 | evaluation protocols and matrix |
---|
0:10:04 | to some or a common evaluation and the benchmarking different systems |
---|
0:10:10 | is feasible challenge is as being organised in time so far |
---|
0:10:16 | the first was having to sausage in |
---|
0:10:18 | the second two thousand and thirteen two thousand |
---|
0:10:23 | it were presented and the corresponding special session loading the interspeech conference |
---|
0:10:32 | is actually current own analyses of this visible for you as well as the their |
---|
0:10:39 | finish definition to partition your see the company around the work |
---|
0:10:47 | but the first thing is challenge involve detection of the division speech |
---|
0:10:51 | the data using a mixture of voice conversion to speech synthesis techniques |
---|
0:10:57 | it was or something during basically to a special session it english speech of those |
---|
0:11:02 | in |
---|
0:11:03 | and the sixteen organisation have debated the this challenge |
---|
0:11:08 | there is useful for those of fifteen involve only logical a system that that's and |
---|
0:11:16 | the a as it was generated we ten different of diffusion speech generation algorithms |
---|
0:11:23 | well based on a large collections accordingly scolding this of course |
---|
0:11:29 | version well |
---|
0:11:31 | and consist of but not without and t v show that a speech |
---|
0:11:37 | one of each was recorded using i one thing microphone |
---|
0:11:41 | and we don't seem difficult channel or of background noise effects |
---|
0:11:48 | and if one database was divided into two subsets coolant |
---|
0:11:53 | the training level of an evaluation set in a speaker and he's joined mar |
---|
0:11:58 | finally i s from the s one was i ni is known |
---|
0:12:05 | where used |
---|
0:12:07 | in the training and development and evaluation set |
---|
0:12:11 | and the one to five times from six s c and it is then going |
---|
0:12:17 | a known or and seen that |
---|
0:12:20 | where are used on the in the evaluation set along we know that that's |
---|
0:12:27 | based on the dimension and of the bias the or on what it used for |
---|
0:12:33 | voice conditions speech synthesis |
---|
0:12:36 | nine of them are we'll database and the hmm of gmm based addition model |
---|
0:12:43 | while only one the s and is the unit selection based |
---|
0:12:46 | speech synthesis implement we that one source madly |
---|
0:12:50 | text-to-speech system |
---|
0:12:56 | the banana but all of easy system based the on the i-vector but the is |
---|
0:13:02 | pretty clear |
---|
0:13:05 | except for the i guess who |
---|
0:13:08 | well that that's are very effective with importantly reasoning |
---|
0:13:13 | greece all equal error rate |
---|
0:13:16 | in the worst case |
---|
0:13:17 | that is s then |
---|
0:13:20 | i don't to one |
---|
0:13:21 | directly to fifty one will ones |
---|
0:13:24 | it is seventeen |
---|
0:13:28 | so that it will the on the left show here the challenge results |
---|
0:13:33 | the in terms of the average equal error rate across all their a score the |
---|
0:13:39 | evaluation set |
---|
0:13:40 | for no one and i do not |
---|
0:13:44 | the exactly a lack of a generalization these results |
---|
0:13:48 | over the table on the left to sure that |
---|
0:13:55 | i'm sorry believable the double on the on the right initials the that the top |
---|
0:14:00 | performing system evaluated only |
---|
0:14:04 | on the s ten |
---|
0:14:07 | the unit selection based speech synthesis |
---|
0:14:11 | isn't that isn't most if you without |
---|
0:14:13 | then the and the most dangerous for speaker verification system is i are shown previously |
---|
0:14:20 | so as then i used to efficiently the biggest three for the msd system in |
---|
0:14:27 | this case |
---|
0:14:31 | and used in one is on the |
---|
0:14:33 | the front end of a against the door for a performing system |
---|
0:14:39 | on the challenge |
---|
0:14:40 | it will not the to read for the in this challenge is related to the |
---|
0:14:45 | two features |
---|
0:14:47 | and the level of the low end of the front and |
---|
0:14:51 | other people between if the in the v a dynasty the use cochlear filter a |
---|
0:14:58 | cepstral coefficients |
---|
0:14:59 | that are related to the human auditory system |
---|
0:15:02 | possible these something that john it problem |
---|
0:15:10 | so no less and i don't know are most the challenge evaluation on the is |
---|
0:15:16 | v is of two thousand fifteen |
---|
0:15:19 | we propose a new feature domain constantly coefficients |
---|
0:15:23 | this on the constant you possible which is a an alternative to put it costs |
---|
0:15:28 | and which employ a variable time-frequency resolution that means |
---|
0:15:34 | greater time resolution for and frequency |
---|
0:15:37 | and you the frequency resolution for lower frequencies |
---|
0:15:42 | so that wasn't you the first one vicinity of an idea which are different more |
---|
0:15:46 | closely the human perception |
---|
0:15:49 | and the to obtain a c uses you features we combine a cuda increase of |
---|
0:15:54 | the initial k would have also with the prediction cepstral analysis |
---|
0:16:02 | i should be for that the only thing started in the challenge |
---|
0:16:07 | where only able to the test then i probably |
---|
0:16:12 | so is it is easy as a |
---|
0:16:15 | obtain completely can be you results for knowing the task and the best results for |
---|
0:16:21 | i do not a week and eighty seven relative improvement on stand |
---|
0:16:26 | and overall seventy two ground control |
---|
0:16:34 | so to summarize basis for fifteen focused on the i don't voice conversion and speech |
---|
0:16:40 | since is a task so not ugly |
---|
0:16:44 | easily disapprovingly detection so no at |
---|
0:16:48 | that's the band the scenario |
---|
0:16:51 | the participant in their invested for to develop features using most simple classifiers |
---|
0:16:59 | and the fourth line regionalisation used in the missing |
---|
0:17:04 | any of |
---|
0:17:06 | i think meet again we the some possible mission improvements |
---|
0:17:18 | i like it doesn't fifteen addition to that used very high quality speech material it'll |
---|
0:17:23 | seventeen addition aims to assess the we have a detection |
---|
0:17:27 | we call in the white |
---|
0:17:29 | condition |
---|
0:17:31 | in focus exclusively on earlier works |
---|
0:17:34 | a second of them i think speaker verification code dimension challenge was presented including this |
---|
0:17:41 | is a special session |
---|
0:17:42 | adding the speech those of indian |
---|
0:17:45 | and fourteen now consider shows a distributed of the challenge |
---|
0:17:52 | cost function if this were from the riesz a text |
---|
0:17:58 | that adults |
---|
0:17:58 | course |
---|
0:18:00 | was proposed was to collect speech lead to over mobile devices |
---|
0:18:05 | in the form of smart phones or a black computers |
---|
0:18:10 | a bible tears of from across to low |
---|
0:18:14 | we collect the a's this will does seven in the database using a playback device |
---|
0:18:20 | and a recording device different acoustic environment |
---|
0:18:27 | we did not to use a realistic scenario using core the recording but we made |
---|
0:18:34 | actually got |
---|
0:18:35 | and do the you don't call me all the target speakers voice |
---|
0:18:40 | to create the plane data collection |
---|
0:18:44 | this is the worst case scenario that of those the use of x sixteen speech |
---|
0:18:50 | were to be linear access |
---|
0:18:56 | the colour curve was is divided into three subsets for training development and evaluation |
---|
0:19:05 | we different speakers replay section and ugly configuration |
---|
0:19:11 | in training and development subset were collected in three different sites |
---|
0:19:16 | and evaluation subset was collected at the same a three sides and also the data |
---|
0:19:23 | for a new side |
---|
0:19:27 | this is the loudest most the inverse italy that |
---|
0:19:34 | in terms of a basically a wider meeting t s for the challenge also here |
---|
0:19:41 | is a clear |
---|
0:19:44 | the this is m is based on the a gmm |
---|
0:19:48 | and the really that's a big effect you |
---|
0:19:52 | with an important case of the equal error rate |
---|
0:19:55 | for all |
---|
0:19:55 | one point eight fifty one point five |
---|
0:19:59 | on these evaluation set |
---|
0:20:04 | the primary evaluation is only whether they can rest of this additional two thousand fifty |
---|
0:20:10 | challenge |
---|
0:20:12 | the equal error rate is computed from scores all across all training segments rather than |
---|
0:20:17 | condition averaging |
---|
0:20:20 | why fourteen estimation |
---|
0:20:22 | perform the baseline while existing three and their the |
---|
0:20:28 | at a performance is the old in more than seven percent relative improvement we used |
---|
0:20:33 | a dismissal a |
---|
0:20:35 | baseline system is based on gmm of a classifier we can you cepstral coefficient features |
---|
0:20:42 | it was provided to the data |
---|
0:20:45 | comparing the baseline mean zero one thing to do |
---|
0:20:49 | it is important performance improvement when using wondering plus their the three |
---|
0:20:57 | this is this idea of the parameter submission to residuals |
---|
0:21:02 | it doesn't seventy |
---|
0:21:04 | i don't training refer to the bar all the time for training |
---|
0:21:09 | a sense for three and a reasonable |
---|
0:21:14 | most all the systems a lower bound for the features |
---|
0:21:19 | this call mom for all the systems to build a gmm classifier |
---|
0:21:24 | single cost you as you can see |
---|
0:21:27 | the invariant use whatever means of all around solution is twenty five one ninety one |
---|
0:21:33 | understand |
---|
0:21:34 | where s the best single system result show |
---|
0:21:39 | and average detection whatever in |
---|
0:21:41 | or |
---|
0:21:42 | only six point seven percent |
---|
0:21:47 | this is a test tools for looters challenge show that |
---|
0:21:52 | the channel of a layer that is more difficult then detection speech synthesis and with |
---|
0:21:58 | compression |
---|
0:22:01 | for me a dimension generalization also remains a problem |
---|
0:22:07 | after the challenge that were that the anomalies |
---|
0:22:10 | ieee beyond zero samples present a beginning on managing speech uterrances |
---|
0:22:17 | is zero really running by for the easy to be a |
---|
0:22:23 | but maybe but i for a modified versions for speech detection |
---|
0:22:29 | these issues it is so for version two point zero was released to colour be |
---|
0:22:35 | anomalous |
---|
0:22:37 | i detected of course the evolution |
---|
0:22:39 | in addition the metadata which describes the recording and playback devices and that was the |
---|
0:22:45 | environments where once released along we and you are not the baseline |
---|
0:22:51 | the new metadata along with the data by ching as there is the number uterrances |
---|
0:22:58 | as well as the a population or the evaluation set |
---|
0:23:02 | remember when i'm better than for each other |
---|
0:23:07 | for a better understanding of the outcomes we can rewrite the square the regulation terms |
---|
0:23:13 | of the speaker measurement recording playback devices |
---|
0:23:17 | acoustic environment is a physical spacing which original stage the that basically then here or |
---|
0:23:25 | it is reasonable because seventeen database was collected you have a different environment |
---|
0:23:32 | the evaluation meeting there about the accent level over even more controlled noise |
---|
0:23:38 | the |
---|
0:23:39 | for example can be in we model noise and balcony are assumed to be noisy |
---|
0:23:46 | all these |
---|
0:23:46 | all right are assumed to be maybe which in your oracle room huh |
---|
0:23:53 | are assumed to be are actually |
---|
0:23:58 | there are under the of a twenty six a little better prices |
---|
0:24:02 | a smart phones the lower bound we |
---|
0:24:07 | if we the we fifteen this moral speakers |
---|
0:24:11 | are assumed to be all over the |
---|
0:24:14 | well e |
---|
0:24:15 | a little larger lot of speakers are assumed to be your mean you rightly |
---|
0:24:20 | and the professional or do we managed are assumed to be i |
---|
0:24:27 | assuming only there are a total twenty five recording devices |
---|
0:24:32 | some are ones that are the weights for my from source would be a little |
---|
0:24:36 | windy and it's where a microphone are assumed to be over the medium by i |
---|
0:24:43 | and the again the regression your and b i |
---|
0:24:50 | this figure shows the impact of different illegally configuration of one lazy performance measure in |
---|
0:24:56 | terms of equal error rate |
---|
0:24:58 | we have sent over a zero for impostor trials are replaced with a replaceable by |
---|
0:25:04 | iteratively the each other little degradation |
---|
0:25:09 | the control the demo on the right shows the resulting legal regulations sort of according |
---|
0:25:15 | to the easy equal error rate in the |
---|
0:25:19 | all pole a core also reflect the supposed to be a is the |
---|
0:25:25 | where we are in this a little degradation |
---|
0:25:29 | this is done |
---|
0:25:30 | they higher than one at a very little degradation the motive for effect in a |
---|
0:25:35 | the three years |
---|
0:25:39 | it is this detection performance of a gmm robot |
---|
0:25:44 | and i-vectors read about smoking the dimension |
---|
0:25:48 | for this thing that a little degradation |
---|
0:25:52 | also expressing that all the equal error rate |
---|
0:25:56 | the first edition these results is that the recently the correlation between the specifically to |
---|
0:26:02 | the thing |
---|
0:26:03 | detection or everybody detection or |
---|
0:26:08 | this is a fine reflect the final complex of overwhelmingly device |
---|
0:26:15 | there was to get about a man and the recording right |
---|
0:26:19 | the control on the right a to see the results in terms of the all |
---|
0:26:24 | only a in a environment going back and replay value |
---|
0:26:32 | results show the number of a single element of the little degradation for all i |
---|
0:26:39 | trials this was all we trials corresponding with either one of the |
---|
0:26:46 | i in my all their acoustic environment a system we need the effect of the |
---|
0:26:51 | playback and recording device |
---|
0:26:57 | to summarise it is able to go seventeen false own regalia |
---|
0:27:02 | so not at a slow was commission |
---|
0:27:05 | performances are reminding |
---|
0:27:07 | even for the worst case scenarios |
---|
0:27:10 | analysis is a very difficult since the data collection was the whole roll |
---|
0:27:17 | remote control data collection mean thing to ensure a which is one recognition or the |
---|
0:27:24 | that is useful to doesn't matter the in |
---|
0:27:27 | so again is related to smoking detection so nicely where |
---|
0:27:32 | text independent scenario will use |
---|
0:27:35 | a there is no gave a database that for a little features and classifiers |
---|
0:27:41 | it generalisation is even missing giving me a |
---|
0:27:45 | it's been mitigated i mean green post evaluation improvement |
---|
0:27:53 | so let's go to the to provide a speaker verification additional information challenge |
---|
0:27:58 | a straightforward on boats |
---|
0:28:00 | speech synthesis and the really |
---|
0:28:09 | as for the because efficient it was examined everything is feasible for special session in |
---|
0:28:16 | their speech goes on a in |
---|
0:28:17 | and forty and fifty organisation there are basically the of the challenge order to standards |
---|
0:28:26 | it is useful because i'm in the in a database is this i would've liked |
---|
0:28:30 | to different use case scenarios |
---|
0:28:32 | well you got and this guy was the score |
---|
0:28:35 | also different a is this strategy of assessing still thing to measure performance on a |
---|
0:28:42 | state |
---|
0:28:42 | instead of the test |
---|
0:28:44 | stand-alone compare measure |
---|
0:28:46 | for this reason for if there is alright we have provided the |
---|
0:28:52 | is this |
---|
0:28:52 | score of the participant |
---|
0:28:55 | so we have got the a s primary method of the minimum normalized the actual |
---|
0:29:00 | cost |
---|
0:29:01 | in this |
---|
0:29:02 | and this is a very maybe at whatever rate |
---|
0:29:06 | also for most discrimination |
---|
0:29:10 | use of the a dcf means that the these this design database is this i'm |
---|
0:29:17 | not for the standard on this task will commercial |
---|
0:29:21 | but they are on the availability in is very system where subject to scooping up |
---|
0:29:34 | necessarily now to use in a normalized dcf so inspired by the detection cost function |
---|
0:29:41 | the |
---|
0:29:42 | c f |
---|
0:29:43 | used in these the sre challenge is |
---|
0:29:47 | i in a this it is |
---|
0:29:51 | aims to assess is the this is the last to make sure |
---|
0:29:55 | to all formalize assessment |
---|
0:29:59 | so long format or by rate |
---|
0:30:02 | or you really motivation for a four |
---|
0:30:09 | okay and the a whole basically |
---|
0:30:14 | countermeasures system |
---|
0:30:17 | there are a total of four possible error |
---|
0:30:20 | where |
---|
0:30:21 | quantify |
---|
0:30:23 | target uses a by the company measures is that |
---|
0:30:27 | i wanna five target is rejected by easy this is the |
---|
0:30:31 | i don't target trials are so that |
---|
0:30:34 | and cost of the idea is |
---|
0:30:40 | the for possible errors in be formally describe so it is for the costs and |
---|
0:30:46 | priors are this i mean that one |
---|
0:30:49 | and the classification tree |
---|
0:30:51 | it |
---|
0:30:52 | are computed be taken |
---|
0:30:55 | the roadie dcf a venue a can be difficult to either us or forming the |
---|
0:31:02 | formation of the well in the nist speaker recognition issue |
---|
0:31:08 | it is useful to normalize the cost |
---|
0:31:11 | the normalized that it is it's a function of a the measured pressure |
---|
0:31:18 | a similar to the bus the challenge efficient |
---|
0:31:22 | is useful for those online dating does not goals of pressure of the set in |
---|
0:31:27 | that means that the calibration |
---|
0:31:30 | so we think source in this case the traditional or mutually the standard measure to |
---|
0:31:34 | install involve a corresponding to go for calibration |
---|
0:31:39 | that correspond to the remaining on remote i |
---|
0:31:43 | in this |
---|
0:31:44 | in by fitting the all my racial the to mine |
---|
0:31:48 | for from the evaluation set using the |
---|
0:31:56 | so this is able to those on a the database is visible the for score |
---|
0:32:01 | one dorky be seen again corpus |
---|
0:32:04 | okay speaker english speech database a or in the a union going |
---|
0:32:10 | charmer still clearly all these things |
---|
0:32:15 | either |
---|
0:32:16 | before weights |
---|
0:32:19 | so it was a the using this is from whatever the seven speakers |
---|
0:32:25 | forty six main thing see more humane |
---|
0:32:27 | but they are the ensemble to a sixteen khz the sixteen bits per sample |
---|
0:32:36 | a collection of course uses colour that these in baseball problem in this analysis |
---|
0:32:44 | it is divided in three |
---|
0:32:46 | for training development evaluation in a speaker is john manner |
---|
0:32:52 | for the logical is there are six |
---|
0:32:55 | text-to-speech and voice conversion box |
---|
0:32:58 | for training and there's fifteen |
---|
0:33:00 | yes and b c score evaluations that |
---|
0:33:05 | what the physical analysis |
---|
0:33:06 | there are then these a holes the |
---|
0:33:09 | environment |
---|
0:33:10 | and i sleepily calculation of training |
---|
0:33:13 | they're an imbalanced |
---|
0:33:17 | we yes |
---|
0:33:18 | the two is then of the double doors to provide state-of-the-art yes this is this |
---|
0:33:24 | if you show a lot of assigning all over the course |
---|
0:33:31 | this table summarize this system which are fundamentally you go first |
---|
0:33:36 | the known |
---|
0:33:37 | small things is the for a zero one at zero six |
---|
0:33:41 | in the lab |
---|
0:33:42 | two v c and four yes systems |
---|
0:33:46 | then |
---|
0:33:46 | well at zero seven to eighty nine d r for a sixteen and even being |
---|
0:33:55 | are the eleven and or something a systems |
---|
0:33:59 | and a sixteen at the eighteen nineteen i don't the reference |
---|
0:34:04 | systems using the same algorithms |
---|
0:34:07 | s |
---|
0:34:07 | at zero four and at zero six |
---|
0:34:11 | the l a verification is the lattice |
---|
0:34:14 | most of our database for speech synthesis and was version is moving the results |
---|
0:34:23 | this is this ensemble of problem a the weather |
---|
0:34:29 | two |
---|
0:34:31 | so |
---|
0:34:37 | we did not complete with any of the local form |
---|
0:34:41 | what if i |
---|
0:34:42 | no |
---|
0:34:43 | the a |
---|
0:34:47 | we did not completely of any of the local phone |
---|
0:34:51 | is you know there speaker one of i |
---|
0:34:55 | employees are entitled to follow that contract to the latter |
---|
0:34:59 | a data |
---|
0:35:02 | employees are entitled followed by a contract so the latter |
---|
0:35:06 | another speaker who finished |
---|
0:35:09 | at that time it's telling faction like and five miles |
---|
0:35:13 | a |
---|
0:35:15 | i at time m is now and faction within five miles |
---|
0:35:20 | as you can see that one of your the synthesis of a speech is quite |
---|
0:35:24 | impressive |
---|
0:35:30 | this is the size of a |
---|
0:35:33 | a subset evaluations and session |
---|
0:35:36 | results in terms of a it is for a little baseline we are provided |
---|
0:35:44 | first of all shows the results for two categories of the us to the speech |
---|
0:35:51 | yes we see |
---|
0:35:53 | yes and v c you might |
---|
0:35:56 | and i saw show results for types of models |
---|
0:36:01 | there are neural network based |
---|
0:36:03 | i one |
---|
0:36:05 | a neural network based and where |
---|
0:36:08 | yes |
---|
0:36:09 | neural network based itsy a statistical model based p c |
---|
0:36:14 | last rule |
---|
0:36:16 | shows the results from different with for generation that the |
---|
0:36:22 | in that are |
---|
0:36:24 | their own where for model classical speech moreover |
---|
0:36:28 | with four combinations |
---|
0:36:29 | spectral filtering with typically and orders |
---|
0:36:34 | in the testing is the complementary you of your over the baseline |
---|
0:36:39 | otherwise dishonest users you features and the idiot there is a someone else |
---|
0:36:45 | sdc features |
---|
0:36:50 | it doesn't say challenge data was created from the rio your presentation visual quality of |
---|
0:36:56 | the score was somewhat cold or |
---|
0:37:00 | leading to improve upon the last challenge it doesn't line in addition to this once |
---|
0:37:05 | you weighted and all |
---|
0:37:07 | acoustic and global calibration |
---|
0:37:10 | once we use these two similarly enrollment listings and devices we establish right |
---|
0:37:19 | the remainder of this work are similarly directly on that |
---|
0:37:24 | we choose a the one sure on the slide |
---|
0:37:28 | realistic environment winkler only holding the noise putting aside for now the additive noise |
---|
0:37:35 | we really a decision we consider perfect microphones |
---|
0:37:39 | and |
---|
0:37:41 | only at the recording this meeting about a five user |
---|
0:37:47 | and for variability representation |
---|
0:37:51 | we can see the that there are |
---|
0:37:53 | it's carry out that the single session as that used a |
---|
0:37:57 | and will only of the device quite in this case the last speaker |
---|
0:38:07 | the physical access scenario assumes use in it is the leading to convey such as |
---|
0:38:13 | illustrated in fig |
---|
0:38:16 | there was a single iteration which please this is then this it will it is |
---|
0:38:20 | also s |
---|
0:38:22 | is this the data will environment distinction room size or categorize in two different |
---|
0:38:28 | in the remote's label |
---|
0:38:30 | i will rule |
---|
0:38:32 | we may be able |
---|
0:38:33 | and see that actual |
---|
0:38:36 | the position of the aec easily see that by the yellow cross |
---|
0:38:41 | circle in the three or whatever position of the to go is illustrated by the |
---|
0:38:46 | blue star |
---|
0:38:48 | well i assess it is harder |
---|
0:38:51 | maybe by the okay well we'll see change a distance yes for the microphone |
---|
0:39:00 | it is also illustrated in the table environment definition there are three categories or at |
---|
0:39:06 | least and |
---|
0:39:07 | and unlabeled a short distance be making this that and see that at least |
---|
0:39:15 | each physical space system to explain that in addition variability are according to the difference |
---|
0:39:20 | between space |
---|
0:39:22 | which can be seen as a wall ceiling and the for submission coefficients |
---|
0:39:28 | as well as the position interval |
---|
0:39:31 | the level overrated variation used busy fighting the or the is sixty two variation by |
---|
0:39:37 | the by are |
---|
0:39:40 | it's fifty whatever item of definition |
---|
0:39:42 | they are the result is six is the u |
---|
0:39:46 | a little i shall we menu and |
---|
0:39:49 | see i recognition |
---|
0:39:52 | it is this is the microphone and that okay or writing reading the visual speech |
---|
0:39:58 | there was a shown are so well |
---|
0:40:02 | we think that although there is an environment as |
---|
0:40:06 | you can see that symbol on the right |
---|
0:40:12 | the man and language for the that's a month it is also illustrated in this |
---|
0:40:17 | paper |
---|
0:40:19 | but something that is modeled by making and then recording over one of five as |
---|
0:40:25 | this |
---|
0:40:26 | and but are sending their according to be is the microphone |
---|
0:40:31 | according are assumed to be made in one over the three zones used to people |
---|
0:40:38 | each representing a different vowel the oldest the problem or |
---|
0:40:45 | in the state in table are a definition if they are labeled character i shows |
---|
0:40:52 | this task of the medium distance and |
---|
0:40:56 | largest |
---|
0:40:58 | in addition to the variation lately we release let us define the means for recording |
---|
0:41:04 | and presentation devices |
---|
0:41:08 | we can see that only the presentation |
---|
0:41:11 | no speaker |
---|
0:41:12 | encoding only and better living in the last speaker if there are four selected |
---|
0:41:19 | we use the categorisation |
---|
0:41:21 | and without any |
---|
0:41:24 | but if there |
---|
0:41:25 | that would be |
---|
0:41:26 | i and it |
---|
0:41:27 | currency one |
---|
0:41:30 | this case we or they have online replaying configuration as you can see and the |
---|
0:41:36 | table |
---|
0:41:37 | on the right |
---|
0:41:40 | the simulation once either two containers all the speakers |
---|
0:41:44 | each with a different range of the whole by about we mean frequency and maybe |
---|
0:41:49 | a linear calibration |
---|
0:41:52 | the first |
---|
0:41:53 | a typical vector category represent the mean dillydallying in full band lot speaker |
---|
0:42:00 | i one last speaker and a megabyte bound we the icsi and units |
---|
0:42:07 | and the being able to more linear or racial a study |
---|
0:42:12 | and one hundred |
---|
0:42:14 | addition |
---|
0:42:15 | and if you're you can see an illustration of set of the higher money frequency |
---|
0:42:21 | responses |
---|
0:42:23 | for i don't be noise model |
---|
0:42:25 | the little device estimated using desynchronized we design a linear system identification |
---|
0:42:33 | based on a linear convolution |
---|
0:42:36 | each one in the finger is the a linear component |
---|
0:42:40 | while from age to if i |
---|
0:42:43 | i the higher wouldn't nonlinear components |
---|
0:42:48 | the blue where the shaded region represent the right boundary |
---|
0:42:57 | is it is still real devices from which measurement where the again for simulation or |
---|
0:43:03 | a clear presentation |
---|
0:43:05 | the first table on the left indicates a multi device is why on the right |
---|
0:43:10 | in the case of interest |
---|
0:43:13 | device that will signifies which type of the magazines |
---|
0:43:17 | what are some all but is a little speaker |
---|
0:43:21 | right most column in the case |
---|
0:43:24 | if the device were used for the simulation of dance in the training and development |
---|
0:43:30 | sets were not devices |
---|
0:43:32 | or evaluations and i don't devices |
---|
0:43:39 | this figure shows again at least commission for the different laws speakers |
---|
0:43:45 | device |
---|
0:43:46 | used for this evaluation |
---|
0:43:49 | the top plot shows a by means of the glottal sure the lower one of |
---|
0:43:54 | the binary but we are the mean and frequency |
---|
0:43:58 | the bottom plot the should ideally a linear calibration |
---|
0:44:02 | in the range of the d |
---|
0:44:04 | or by about |
---|
0:44:07 | devices are sort the wheat the wideband |
---|
0:44:15 | this figure shows baseline results for maybe a scenario of the is useful to two |
---|
0:44:21 | thousand nineteen database |
---|
0:44:23 | results are used to read and fourteen you important to be in configuration |
---|
0:44:27 | one you acoustic environments |
---|
0:44:29 | and for to monitor a standard on arrays here's something german equal error rate between |
---|
0:44:35 | target and zero for impostor trials that is the blood spatter |
---|
0:44:40 | and target and replaceable from the area they leave are |
---|
0:44:44 | i mean don't wanna mixture on the stand-alone replace moving in terms of equal error |
---|
0:44:49 | rate |
---|
0:44:51 | for baseline a be one and b two |
---|
0:44:55 | and the bottom panel there is a combine is the and cm results use created |
---|
0:45:00 | in terms of the me |
---|
0:45:03 | e it is yes |
---|
0:45:05 | for this result we guess they the to the is anyone interview medium |
---|
0:45:10 | as for the previous challenges expecting clear |
---|
0:45:14 | and moreover the worst the screens are |
---|
0:45:18 | two or swings high when the device scenes and a little darker to talk be |
---|
0:45:23 | stuff |
---|
0:45:29 | its own can now the challenge results this figure shows the profiles for the baseline |
---|
0:45:36 | this system b |
---|
0:45:37 | zero two |
---|
0:45:39 | and the best the |
---|
0:45:41 | performing primary system for the in the means you're fine |
---|
0:45:46 | and the seen teams single system |
---|
0:45:49 | it is also shown the second best performing the single system for a in the |
---|
0:45:56 | for immorality |
---|
0:45:58 | forty five |
---|
0:45:59 | so the lowest equal error rate is zero point two |
---|
0:46:03 | percent |
---|
0:46:05 | that is a greater us out |
---|
0:46:08 | however for this results it is clear that there is a substantial gaps between |
---|
0:46:14 | primary and single system |
---|
0:46:17 | a four |
---|
0:46:19 | so this means that fusion is important |
---|
0:46:25 | is line shows the one the mean the team this year and equal error rate |
---|
0:46:30 | the results from one before you conditions |
---|
0:46:33 | to the in the age scenario |
---|
0:46:36 | the first screening feel boring the on the x-axis and then don't whether or not |
---|
0:46:41 | the system are the nn based or three systems |
---|
0:46:45 | while the second denotes whether or not the systems are instance systems |
---|
0:46:50 | which combine more all |
---|
0:46:52 | so systems |
---|
0:46:53 | or single system |
---|
0:46:56 | we cannot the for really there is a manager you all the n and beast |
---|
0:47:01 | and the in symbol systems |
---|
0:47:04 | in addition to is also clear that the new word error rate and mean this |
---|
0:47:07 | are measurements that are not correlated |
---|
0:47:12 | as you can see in these two are red and blue |
---|
0:47:19 | in this like the it is shown all the results for the thirty nine hour |
---|
0:47:24 | in the evaluation set for the top then brown many solutions |
---|
0:47:28 | first of all we can see that the baseline is the equal error rate |
---|
0:47:33 | that means no smoking |
---|
0:47:35 | is two point five percent |
---|
0:47:37 | when we need class i think moving at a the is this is then becomes |
---|
0:47:43 | what inaudible |
---|
0:47:45 | again if the individual tax someone else a degree is the performance |
---|
0:47:52 | that are easy to detect |
---|
0:47:54 | there reminding the against you |
---|
0:47:57 | us some degree the easy performance |
---|
0:48:03 | and i difficult to the data they want in the or ranch a physical |
---|
0:48:08 | and one only one that is the a seventeen |
---|
0:48:12 | as in this entire on the knees the but is very difficult to detect |
---|
0:48:16 | that is the one in the utterance to scroll |
---|
0:48:22 | so let's evolution no i the challenge results for but these figures show that provides |
---|
0:48:29 | for the baseline system be zero one |
---|
0:48:33 | the best performing primary system fourteen d u and the same teams of the systems |
---|
0:48:41 | the lowest the equal error rate here used zero point four |
---|
0:48:45 | the is indeed we results |
---|
0:48:50 | was it to invade |
---|
0:48:52 | here there is less a discrepancy between primary and single system |
---|
0:48:57 | so fusion since that is not so we bought |
---|
0:49:03 | this is my shoulder while the mean dcf decoder ring the results for one if |
---|
0:49:08 | is shown that to the each and you |
---|
0:49:12 | and anything point as before on the x-axis denote a unit based in the nn |
---|
0:49:18 | three or and channel and |
---|
0:49:20 | the known in some other systems |
---|
0:49:23 | not of the to as for any |
---|
0:49:26 | cole p the there is a manager he or and bees and the instance systems |
---|
0:49:37 | it is like a this on or on the results for all the nine a |
---|
0:49:42 | single evaluation set for the door then primary submission |
---|
0:49:47 | and we can see that the baseline is the query |
---|
0:49:51 | well seems keys needs solos moving is he going for example for stack |
---|
0:49:57 | when we in class looking at a this is then |
---|
0:50:01 | because |
---|
0:50:01 | wouldn't it |
---|
0:50:03 | so looking at these i |
---|
0:50:06 | we can see that the performance is increases |
---|
0:50:10 | where |
---|
0:50:12 | the distance back to okay becomes greater |
---|
0:50:15 | so there are very fancy one |
---|
0:50:18 | and decreases when the quietly of the device we got better |
---|
0:50:22 | so real routes suitable |
---|
0:50:29 | it is nice on all of the silence now four or other than twenty seven |
---|
0:50:34 | that environments and evaluation sets again for the a the parameter estimation |
---|
0:50:41 | so looking at least and over individual environments we can see that the performance is |
---|
0:50:46 | the graces where the room i recall |
---|
0:50:50 | really |
---|
0:50:51 | so the received go |
---|
0:50:53 | in case is when they are very the given variational model because higher |
---|
0:50:58 | c |
---|
0:50:58 | and increase when the to go to easily distance becomes higher |
---|
0:51:04 | getting |
---|
0:51:05 | see what |
---|
0:51:09 | so to summarise a system that doesn't like being focus on the |
---|
0:51:14 | but eagerly and yes or voice conversion |
---|
0:51:20 | a simple even if one would be evaluated |
---|
0:51:23 | we have a show that to there is this is then the i wanna normal |
---|
0:51:29 | to squatting task |
---|
0:51:32 | we have defined and limiting the dcf was just moving on to measure performance on |
---|
0:51:38 | a c d |
---|
0:51:39 | so instead of a doing these the on the standard on one dimensional |
---|
0:51:45 | we have seen a transition from features to classifiers so and unit order to into |
---|
0:51:53 | and that |
---|
0:51:54 | and one double the fused system with the biggest challenges |
---|
0:52:00 | don't demand countermeasures are very |
---|
0:52:03 | how to the speech sounds are |
---|
0:52:08 | very natural |
---|
0:52:10 | is the recognition accuracy very clear by detection again be proven to work this time |
---|
0:52:16 | of by only and stage |
---|
0:52:20 | generalization is in missing |
---|
0:52:22 | much more as to be done |
---|
0:52:26 | so i don't to the union a and for decision |
---|
0:52:30 | the is this will two thousand |
---|
0:52:32 | then t one |
---|
0:52:38 | so but for finnish thing to do not i like to wish to some softer |
---|
0:52:42 | each for speaker recognition grunting using from us at all |
---|
0:52:50 | it appears to keep the from a is |
---|
0:52:54 | and my results to identically to overcome my as well |
---|
0:52:58 | currently silently to from the university |
---|
0:53:03 | you can finally two databases for easy and the disposing |
---|
0:53:10 | i thought winter the is additional database misleading |
---|
0:53:13 | and nist and the are star burst in that the speaker recognition database |
---|
0:53:23 | a right don't |
---|
0:53:24 | and the text dependent speaker recognition database |
---|
0:53:29 | we also the a e for it is simply a database from |
---|
0:53:36 | and the speaker wire a new speech and boxers |
---|
0:53:45 | so here you can find some of the for this thing |
---|
0:53:49 | matlab implementation of training and the scope of this common conditions |
---|
0:53:54 | this is used as you features |
---|
0:53:56 | and the three these coding systems that an easy to a last challenge |
---|
0:54:07 | you know website you can find the matlab client on implementation of the teens yes |
---|
0:54:15 | and the in your with the regarding the is a you please easy the a |
---|
0:54:22 | one website |
---|
0:54:23 | we need you are cool |
---|
0:54:27 | last time at least i like to shoot due to budget |
---|
0:54:31 | where i'm the principal investigator region two d measurement recognition also not only speech |
---|
0:54:39 | a disapproving and |
---|
0:54:40 | closing phase information |
---|
0:54:43 | classifiers and respect |
---|
0:54:45 | thus nazis ultimately increase the number eighty three networks |
---|
0:54:49 | and the domain instruments increment because representing volume i mean and he uttering networks |
---|
0:54:56 | and the second respect |
---|
0:54:58 | use a friend gentlemen project |
---|
0:55:01 | and is completely means he or more secure and presenter's the remote embodiment person authentication |
---|
0:55:11 | thank you for listening and see you the you at session |
---|